Using the HBase-Spark connector

The HBase-Spark Connector bridges the gap between the simple HBase Key Value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark.

An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any other data sources such as Hive, ORC, Parquet, JSON, etc.

  1. Edit the HBase RegionServer configuration for running Spark Filter.
    Spark Filter is used when Spark SQL Where clauses are in use.
    1. In Cloudera Manager, select the HBase service.
    2. Click the Configuration tab.
    3. Search for regionserver environment.
    4. Find the RegionServer Environment Advanced Configuration Snippet (Safety Valve).
    5. Click the plus icon to add the following property:

      Key: HBASE_CLASSPATH

      Value: /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK JAR NAME***].jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK PROTOCOL JAR NAME***].jar:/opt/cloudera/parcels/CDH/jars/scala-library-2.11.12.jar

    6. Ensure that the listed jars have the correct version number in their name.
    7. Click Save Changes.
    8. Restart Region Server.
  2. Invoke Spark shell with some addition jars using the following snippet:
    spark-shell --jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK JAR NAME***].jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK PROTOCOL JAR NAME***].jar--files /etc/hbase/conf/hbase-site.xml --conf spark.driver.extraClassPath=/etc/hbase/conf

    Ensure that the listed jars have the correct version number in their name.