You can use HBase-Spark connector on your secure cluster to perform READ and WRITE
operations. The HBase-Spark Connector bridges the gap between the simple HBase Key Value
store and complex relational SQL queries and enables users to perform complex data analytics
on top of HBase using Spark.
An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any
other data sources such as Hive, ORC, Parquet, JSON, etc.
-
Edit the HBase RegionServer configuration for running Spark Filter.
Spark Filter is used when Spark SQL Where clauses are in use.
-
In Cloudera Manager, select the HBase
service.
-
Click the Configuration tab.
-
Search for
regionserver environment
.
-
Find the RegionServer Environment Advanced Configuration
Snippet (Safety Valve).
-
Click the plus icon to add the following property:
Key:
HBASE_CLASSPATH
Value:
/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK JAR NAME***].jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK PROTOCOL SHADED JAR NAME***].jar:/opt/cloudera/parcels/CDH/jars/scala-library-2.11.12.jar
-
Ensure that the listed jars have the correct version number in their
name.
-
Click Save Changes.
-
Restart Region Server.
-
Invoke Spark shell with some addition jars using the following snippet:
spark-shell --jars /opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK JAR NAME***].jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/[***HBASE-SPARK PROTOCOL SHADED JAR NAME***].jar --files /etc/hbase/conf/hbase-site.xml --conf spark.driver.extraClassPath=/etc/hbase/conf
Ensure that the listed jars have the correct version number in their
name.
The following blog post provides additional information about Spark and HBase
usage in CDP Public Cloud: How to use Apache Spark with CDP Operational Database Experience.