You can configure Spark in your Data Engineering cluster to interact with the
Cloudera Operational Database (COD). You can use this integration to READ and WRITE to COD
from Spark on CDE (Cloudera Data Engineering) using the spark-hbase connector.
If you want to leverage Phoenix instead of HBase, see COD-CDE using
Phoenix.
-
Download the COD client configurations.
For the Spark in CDE to connect to COD, it requires the
hbase-site.xml configuration of the COD cluster. Refer
to the following steps.
-
Go to the COD UI and click on the test-cod
database.
-
Go to the tab of the COD database, and copy the command under the
HBase Client Configuration URL field.
curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"
- Ensure to provide the workload password for the
curl command.
- Explore the downloaded zip file to obtain the
hbase-site.xml file.
-
Create the HBase table.
Create a new HBase table inside the COD database using the
Hue link.
-
Go to the COD UI and click on the test-cod
database.
-
Click on the Hue link under the SQL
EDITOR field.
-
On the Hue UI, click the HBase menu item on the
left navigation panel. Click New Table.
-
Choose a table name and column families, and click on the
Submit button. For example, let us consider
the table name testtable and a single column family
testcf.
-
Configure the job using CDE CLI.
-
Configure CDE CLI to point to the virtual cluster created in the
previous step. For more details, see Configuring the CLI
client.
-
Create resources using the following command.
cde resource create --name cod-spark-resource
-
Upload hbase-site.xml.
cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml
-
Upload the demo app jar that was built earlier.
cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar
-
Create the CDE job using a JSON definition.
{
"mounts": [
{
"resourceName": "cod-spark-resource"
}
],
"name": "my-cod-spark-job",
"spark": {
"className": "<YOUR MAIN CLASS>",
"conf": {
"spark.executor.extraClassPath": "/app/mount/conf",
"spark.driver.extraClassPath": "/app/mount/conf"
},
"args": [ "<YOUR ARGS IF ANY>"],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-hbase-project.jar",
"pyFiles": [],
"files": ["conf/hbase-site.xml"],
"numExecutors": 4
}
}
-
Import the job using the following command, assuming that the above
JSON is saved as my-job-definition.json.
cde job import --file my-job-definition.json
The spark.driver.extraClassPath and
spark.executor.extraClassPath inside the job definition
points to the same path which is used to upload the
hbase-site.xml into the CDE resource.
This way the hbase-site.xml is automatically loaded from the
classpath and you do not need to refer to it explicitly in your Spark code. You can
define as follows.
val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)