Use Spark

You can write data to HBase from Apache Spark using def saveAsHadoopDataset(conf: JobConf): Unit.

This example is adapted from a post on the spark-users mailing list.

// Note: mapred package is used, instead of the
                    // mapreduce package which contains new hadoop APIs.
                    
                    import org.apache.hadoop.hbase.mapred.TableOutputFormat
                    import org.apache.hadoop.hbase.client
                    // ... some other settings
                    
                    val conf = HBaseConfiguration.create()
                    
                    // general hbase settings
                    conf.set("hbase.rootdir",
                    "hdfs://" + nameNodeURL + ":" + hdfsPort + "/hbase")
                    conf.setBoolean("hbase.cluster.distributed", true)
                    conf.set("hbase.zookeeper.quorum", hostname)
                    conf.setInt("hbase.client.scanner.caching", 10000)
                    // ... some other settings
                    
                    val jobConfig: JobConf = new JobConf(conf, this.getClass)
                    
                    // Note:  TableOutputFormat is used as deprecated code
                    // because JobConf is an old hadoop API
                    jobConfig.setOutputFormat(classOf[TableOutputFormat])
                    jobConfig.set(TableOutputFormat.OUTPUT_TABLE, outputTable)

Next, provide the mapping between how the data looks in Spark and how it should look in HBase. The following example assumes that your HBase table has two column families, col_1 and col_2, and that your data is formatted in sets of three in Spark, like (row_key, col_1, col_2).

def convert(triple: (Int, Int, Int)) = {
                val p = new Put(Bytes.toBytes(triple._1))
                p.add(Bytes.toBytes("cf"),
                Bytes.toBytes("col_1"),
                Bytes.toBytes(triple._2))
                p.add(Bytes.toBytes("cf"),
                Bytes.toBytes("col_2"),
                Bytes.toBytes(triple._3))
                (new ImmutableBytesWritable, p)
                }

To write the data from Spark to HBase, you might use:

new PairRDDFunctions(localData.map(convert)).saveAsHadoopDataset(jobConfig)

Use Spark

We want your opinion

How can we improve this page?