Example: Using the HBase-Spark connector with the dataset on a remote cluster
Learn how to use the HBase-Spark connector by following an example scenario when the dataset is located on a different cluster. HBase configuration can be altered in these cases.
Schema
In this example we want to store personal data in an HBase table. We want to store name, email address, birth date, and height as floating point numbers. The contact information (email) is stored in the c
column, family and personal information (birth date, height) is stored in the p
column family. The key in HBase table will be the name attribute.
Spark | HBase | |
---|---|---|
Type/Table | Person |
person |
Name | name: String |
key |
Email address | email: String |
c:email |
Birth date | birthDate: Date |
p:birthDate |
Height | height: Float |
p:height |
Create HBase table
Use the following command to create the HBase table:
shell> create 'person', 'p', 'c'
Insert data (Scala)
Use the following spark code in spark-shell
or spark3-shell
to insert data into the HBase table:
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration
val sql = spark.sqlContext
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", spark.conf.get("spark.hadoop.hbase.zookeeper.quorum"))
conf.set("hbase.security.authentication", spark.conf.get("spark.hadoop.hbase.security.authentication"))
conf.set("hbase.regionserver.kerberos.principal", spark.conf.get("spark.hadoop.hbase.regionserver.kerberos.principal"))
conf.set("hbase.rpc.protection", spark.conf.get("spark.hadoop.hbase.rpc.protection"))
// the latest HBaseContext will be used afterwards
new HBaseContext(spark.sparkContext, conf)
import java.sql.Date
case class Person(name: String, email: String, birthDate: Date, height: Float)
var personDS = Seq(Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f), Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)).toDS
personDS.write.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "name STRING :key, email STRING c:email, birthDate DATE p:birthDate, height FLOAT p:height").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", true).save()
Read data back (Scala)
Use the following snippet in spark-shell
or spark3-shell
to read the data back:
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration
val sql = spark.sqlContext
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", spark.conf.get("spark.hadoop.hbase.zookeeper.quorum"))
conf.set("hbase.security.authentication", spark.conf.get("spark.hadoop.hbase.security.authentication"))
conf.set("hbase.regionserver.kerberos.principal", spark.conf.get("spark.hadoop.hbase.regionserver.kerberos.principal"))
conf.set("hbase.rpc.protection", spark.conf.get("spark.hadoop.hbase.rpc.protection"))
// the latest HBaseContext will be used afterwards
new HBaseContext(spark.sparkContext, conf)
val sql = spark.sqlContext
val df = sql.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "name STRING :key, email STRING c:email, " + "birthDate DATE p:birthDate, height FLOAT p:height").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", true).load()
df.createOrReplaceTempView("personView")
val results = sql.sql("SELECT * FROM personView WHERE name = 'alice'")
results.show()