Understanding the extractHBaseCells
Morphline
Command
The extractHBaseCells
morphline command extracts cells
from an HBase result and transforms the values into a
SolrInputDocument
. The command consists of an array
of zero or more mapping specifications.
Each mapping has:
- The
inputColumn
parameter, which specifies the data from HBase for populating a field in Solr. It has the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of validinputColumn
values:-
mycolumnfamily:myqualifier
-
mycolumnfamily:my*
-
mycolumnfamily:*
-
- The
outputField
parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example:first_name
. - Dynamic output fields are enabled by the
outputField
parameter ending with a wildcard (*
). For example:inputColumn : "mycolumnfamily:*" outputField : "belongs_to_*"
In this case, if you make these puts in HBase:put 'table_name' , 'row1' , 'mycolumnfamily:1' , 'foo' put 'table_name' , 'row1' , 'mycolumnfamily:9' , 'bar'
Then the fields of the Solr document are as follows:belongs_to_1 : foo belongs_to_9 : bar
- The
type
parameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported byorg.apache.hadoop.hbase.util.Bytes.to*
(which currently includesbyte[]
,int
,long
,string
,boolean
,float
,double
,short
, andbigdecimal
). Use typebyte[]
to pass the byte array through to the morphline without conversion.-
type:byte[] copies the byte array unmodified into the record output field
-
type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
-
type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
-
type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
-
type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
-
type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
-
type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
-
type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
-
type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
type
parameter can be the name of a Java class that implements thecom.ngdata.hbaseindexer.parse.ByteArrayValueMapper
interface.HBase data formatting does not always match what is specified byorg.apache.hadoop.hbase.util.Bytes.*
. For example, this can occur with data of typefloat
ordouble
. You can enable indexing of such HBase data by converting the data. There are various ways to do so, including:- Using Java morphline command to parse input data, converting
it to the expected output. For
example:
{ imports : "import java.util.*;" code: """ // manipulate the contents of a record field String stringAmount = (String) record.getFirstValue("amount"); Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl); return child.process(record); // pass record to next command in chain """ }
- Creating table fields with binary format and then using types
such as double or float in a
morphline.conf
. You could create a table in HBase for storing doubles using commands similar to:CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,') TBLPROPERTIES ('hbase.table.name' = 'sample_lily');
-
- The
source
parameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices arevalue
orqualifier
. Whenvalue
is specified, the HBase cell value is used as input for indexing. Whenqualifier
is specified, then the HBase column qualifier is used as input for indexing. The default isvalue
.