Using the HBase-Spark connector

The HBase-Spark Connector bridges the gap between the simple HBase Key Value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark.

An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any other data sources such as Hive, ORC, Parquet, JSON, etc.

  1. Edit the HBase RegionServer configuration for running Spark Filter.
    Spark Filter is used when Spark SQL Where clauses are in use.
    1. In Cloudera Manager, select the HBase service.
    2. Click the Configuration tab.
    3. Search for regionserver environment.
    4. Find the RegionServer Environment Advanced Configuration Snippet (Safety Valve).
    5. Click the plus icon to add the following property:
      1. Key: HBASE_CLASSPATH

        Value: /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-***VERSION NUMBER***-198.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded-***VERSION NUMBER***-198.jar:/opt/cloudera/parcels/CDH/jars/scala-library-2.11.12.jar

      2. Spark3 shell version

        spark3-shell --jars /opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-***VERSION NUMBER***-XXX.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-protocol-shaded-***VERSION NUMBER***-XXX.jar --files /etc/hbase/conf/hbase-site.xml --conf spark.driver.extraClassPath=/etc/hbase/conf

    6. Ensure that the listed jars have the correct version number in their name.
    7. Click Save Changes.
    8. Restart Region Server.
  2. Invoke Spark shell with some addition jars using the following snippet:
    spark-shell --jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-***VERSION NUMBER***-198.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded-***VERSION NUMBER***-198.jar --files /etc/hbase/conf/hbase-site.xml --conf spark.driver.extraClassPath=/etc/hbase/conf

    Ensure that the listed jars have the correct version number in their name.

    The following blog post provides additional information about Spark and HBase usage in CDP Public Cloud: https://blog.cloudera.com/how-to-use-apache-spark-with-cdp-operational-database-experience/.