Understanding the extractHBaseCells Morphline
Command
The extractHBaseCells morphline command extracts cells
from an HBase result and transforms the values into a
SolrInputDocument. The command consists of an array
of zero or more mapping specifications.
Each mapping has:
- The
inputColumnparameter, which specifies the data from HBase for populating a field in Solr. It has the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of validinputColumnvalues:-
mycolumnfamily:myqualifier -
mycolumnfamily:my* -
mycolumnfamily:*
-
- The
outputFieldparameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example:first_name. - Dynamic output fields are enabled by the
outputFieldparameter ending with a wildcard (*). For example:
In this case, if you make these puts in HBase:inputColumn : "mycolumnfamily:*" outputField : "belongs_to_*"
Then the fields of the Solr document are as follows:put 'table_name' , 'row1' , 'mycolumnfamily:1' , 'foo' put 'table_name' , 'row1' , 'mycolumnfamily:9' , 'bar'belongs_to_1 : foo belongs_to_9 : bar - The
typeparameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported byorg.apache.hadoop.hbase.util.Bytes.to*(which currently includesbyte[],int,long,string,boolean,float,double,short, andbigdecimal). Use typebyte[]to pass the byte array through to the morphline without conversion.-
type:byte[] copies the byte array unmodified into the record output field -
type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt -
type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong -
type:string converts with org.apache.hadoop.hbase.util.Bytes.toString -
type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean -
type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat -
type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble -
type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort -
type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
typeparameter can be the name of a Java class that implements thecom.ngdata.hbaseindexer.parse.ByteArrayValueMapperinterface.HBase data formatting does not always match what is specified byorg.apache.hadoop.hbase.util.Bytes.*. For example, this can occur with data of typefloatordouble. You can enable indexing of such HBase data by converting the data. There are various ways to do so, including:- Using Java morphline command to parse input data, converting
it to the expected output. For
example:
{ imports : "import java.util.*;" code: """ // manipulate the contents of a record field String stringAmount = (String) record.getFirstValue("amount"); Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl); return child.process(record); // pass record to next command in chain """ } - Creating table fields with binary format and then using types
such as double or float in a
morphline.conf. You could create a table in HBase for storing doubles using commands similar to:CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,') TBLPROPERTIES ('hbase.table.name' = 'sample_lily');
-
- The
sourceparameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices arevalueorqualifier. Whenvalueis specified, the HBase cell value is used as input for indexing. Whenqualifieris specified, then the HBase column qualifier is used as input for indexing. The default isvalue.
