Use Spark
You can write data to HBase from Apache Spark using def
saveAsHadoopDataset(conf: JobConf): Unit
.
This example is adapted from a post on the spark-users mailing
list.
// Note: mapred package is used, instead of the // mapreduce package which contains new hadoop APIs. import org.apache.hadoop.hbase.mapred.TableOutputFormat import org.apache.hadoop.hbase.client // ... some other settings val conf = HBaseConfiguration.create() // general hbase settings conf.set("hbase.rootdir", "hdfs://" + nameNodeURL + ":" + hdfsPort + "/hbase") conf.setBoolean("hbase.cluster.distributed", true) conf.set("hbase.zookeeper.quorum", hostname) conf.setInt("hbase.client.scanner.caching", 10000) // ... some other settings val jobConfig: JobConf = new JobConf(conf, this.getClass) // Note: TableOutputFormat is used as deprecated code // because JobConf is an old hadoop API jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, outputTable)
Next, provide the mapping between how the data looks in Spark and how
it should look in HBase. The following example assumes that your HBase
table has two column families, col_1 and col_2, and that your data is
formatted in sets of three in Spark, like (row_key, col_1,
col_2).
def convert(triple: (Int, Int, Int)) = { val p = new Put(Bytes.toBytes(triple._1)) p.add(Bytes.toBytes("cf"), Bytes.toBytes("col_1"), Bytes.toBytes(triple._2)) p.add(Bytes.toBytes("cf"), Bytes.toBytes("col_2"), Bytes.toBytes(triple._3)) (new ImmutableBytesWritable, p) }
To write the data from Spark to HBase, you might
use:
new PairRDDFunctions(localData.map(convert)).saveAsHadoopDataset(jobConfig)