Example: Using the HBase-Spark connector with the dataset on a remote cluster

Learn how to use the HBase-Spark connector by following an example scenario when the dataset is located on a different cluster. HBase configuration can be altered in these cases.

Schema

In this example we want to store personal data in an HBase table. We want to store name, email address, birth date, and height as floating point numbers. The contact information (email) is stored in the c column, family and personal information (birth date, height) is stored in the p column family. The key in HBase table will be the name attribute.


	Spark	HBase
Type/Table	`Person`	`person`
Name	`name: String`	`key`
Email address	`email: String`	`c:email`
Birth date	`birthDate: Date`	`p:birthDate`
Height	`height: Float`	`p:height`

Create HBase table

Use the following command to create the HBase table:

shell> create 'person', 'p', 'c'

Insert data (Scala)

Use the following spark code in spark-shell or spark3-shell to insert data into the HBase table:

import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration

val sql = spark.sqlContext

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", spark.conf.get("spark.hadoop.hbase.zookeeper.quorum"))
conf.set("hbase.security.authentication", spark.conf.get("spark.hadoop.hbase.security.authentication"))
conf.set("hbase.regionserver.kerberos.principal", spark.conf.get("spark.hadoop.hbase.regionserver.kerberos.principal"))
conf.set("hbase.rpc.protection", spark.conf.get("spark.hadoop.hbase.rpc.protection"))



// the latest HBaseContext will be used afterwards
new HBaseContext(spark.sparkContext, conf)

import java.sql.Date

case class Person(name: String, email: String, birthDate: Date, height: Float)

var personDS = Seq(Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f),  Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)).toDS

personDS.write.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "name STRING :key, email STRING c:email, birthDate DATE p:birthDate, height FLOAT p:height").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", true).save()

Read data back (Scala)

Use the following snippet in spark-shell or spark3-shell to read the data back:

import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration

val sql = spark.sqlContext

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", spark.conf.get("spark.hadoop.hbase.zookeeper.quorum"))
conf.set("hbase.security.authentication", spark.conf.get("spark.hadoop.hbase.security.authentication"))
conf.set("hbase.regionserver.kerberos.principal", spark.conf.get("spark.hadoop.hbase.regionserver.kerberos.principal"))
conf.set("hbase.rpc.protection", spark.conf.get("spark.hadoop.hbase.rpc.protection"))


// the latest HBaseContext will be used afterwards
new HBaseContext(spark.sparkContext, conf)


val sql = spark.sqlContext

val df = sql.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "name STRING :key, email STRING c:email, " + "birthDate DATE p:birthDate, height FLOAT p:height").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", true).load()
df.createOrReplaceTempView("personView")

val results = sql.sql("SELECT * FROM personView WHERE name = 'alice'")
results.show()