HBase Datastores

HBase

DataNucleus supports persisting/retrieving objects to/from HBase datastores (using the datanucleus-hbase plugin, which makes use of the HBase/Hadoop jars). Simply specify your "connectionURL" as follows

datanucleus.ConnectionURL=hbase[:{server}:{port}]
datanucleus.ConnectionUserName=
datanucleus.ConnectionPassword=

If you just specify the URL as hbase then you have a local HBase datastore, otherwise it tries to connect to the datastore at {server}:{port}. Alternatively just put "hbase" as the URL and set the zookeeper details in "hbase-site.xml" as normal. You then create your PMF/EMF as normal and use JDO/JPA as normal.

The jars required to use DataNucleus HBase persistence are datanucleus-core, datanucleus-api-jdo/datanucleus-api-jpa, datanucleus-hbase and hbase, hadoop-core, zookeeper.

There are tutorials available for use of DataNucleus with HBase for JDO and for JPA

Things to bear in mind with HBase usage :-

  • Creation of a PMF/EMF will create an internal HBaseConnectionPool
  • Creation of a PM/EM will create/use a HConnection.
  • Querying can be performed using JDOQL or JPQL. Some components of a filter are handled in the datastore, and the remainder in-memory. Currently any expression of a field (in the same table), or a literal are handled in-datastore, as are the operators &&, ||, >, >=, <, <=, ==, and !=.
  • The "row key" will be the PK field(s) when using "application-identity", and the generated id when using "datastore-identity"

Field/Column Naming

By default each field is mapped to a single column in the datastore, with the family name being the name of the table, and the column name using the name of the field as its basis (but following JDO/JPA naming strategies for the precise column name). You can override this as follows

@Column(name="{familyName}:{qualifierName}")
String myField;

replacing {familyName} with the family name you want to use, and {qualifierName} with the column name (qualifier name in HBase terminology) you want to use. Alternatively if you don't want to override the default family name (the table name), then you just omit the "{familyName}:" part and simply specify the column name.

MetaData Extensions

Some metadata extensions (@Extension) have been added to DataNucleus to support some of HBase particular table creation options. The supported attributes at Table creation for a column family are:

  • bloomFilter : An advanced feature available in HBase is Bloom filters, allowing you to improve lookup times given you have a specific access pattern. Default is NONE. Possible values are: ROW -> use the row key for the filter, ROWKEY -> use the row key and column key (family+qualifier) for the filter.
  • inMemory : The in-memory flag defaults to false. Setting it to true is not a guarantee that all blocks of a family are loaded into memory nor that they stay there. It is an elevated priority, to keep them in memory as soon as they are loaded during a normal retrieval operation, and until the pressure on the heap (the memory available to the Java-based server processes)is too high, at which time they need to be discarded by force.
  • maxVersions : Per family, you can specify how many versions of each value you want to keep.The default value is 3, but you may reduce it to 1, for example, in case you know for sure that you will never want to look at older values.
  • keepDeletedCells : ColumnFamilies can optionally keep deleted cells. That means deleted cells can still be retrieved with Get or Scan operations, as long these operations have a time range specified that ends before the timestamp of any delete that would affect the cells. This allows for point in time queries even in the presence of deletes. Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete markers.
  • compression : HBase has pluggable compression algorithm, default value is NONE. Possible values GZ, LZO, SNAPPY.
  • blockCacheEnabled : As HBase reads entire blocks of data for efficient I/O usage, it retains these blocks in an in-memory cache so that subsequent reads do not need any disk operation. The default of true enables the block cache for every read operation. But if your use case only ever has sequential reads on a particular column family, it is advisable that you disable it from polluting the block cache by setting it to false.
  • timeToLive : HBase supports predicate deletions on the number of versions kept for each value, but also on specific times. The time-to-live (or TTL) sets a threshold based on the timestamp of a value and the internal housekeeping is checking automatically if a value exceeds its TTL. If that is the case, it is dropped during major compactions

To express these options, a format similar to a properties file is used such as:

hbase.columnFamily.[family name to apply property on].[attribute] = {value}

where:

  • attribute: One of the above defined attributes (inMemory, bloomFilter,...)
  • family name to apply property on: The column family affected.
  • value: Associated value for this attribute.

An example that would apply to the "meta" column family, that would set the bloom filter option to ROWKEY, and the in memory flag to true would look like:

@PersistenceCapable 
@Extensions({
    @Extension(vendorName = "datanucleus", key = "hbase.columnFamily.meta.bloomFilter", value = "ROWKEY"), 
    @Extension(vendorName = "datanucleus", key = "hbase.columnFamily.meta.inMemory", value = "true") 
}) 
public class MyClass
{
    @PrimaryKey 
    private long id; 

    // column family data, name of attribute blob 
    @Column(name = "data:blob") 
    private String blob; 

    // column family meta, name of attribute firstName 
    @Column(name = "meta:firstName") 
    private String firstName;

    // column family meta, name of attribute firstName 
    @Column(name = "meta:lastName") 
    private String lastName;
   
   [ ... getter and setter ... ]