Configure data engineering Spark to use with COD

You can configure Spark in your Data Engineering cluster to interact with the Cloudera Operational Database (COD). You can use this integration to READ and WRITE to COD from Spark on CDE (Cloudera Data Engineering) using the spark-hbase connector.

If you want to leverage Phoenix instead of HBase, see COD-CDE using Phoenix.

  • COD is already provisioned and the database is created. For ore information, see Onboarding COD users.

    For this example, let us assume test-cod as the database name.

  • CDE is already provisioned and the virtual cluster is already created. For more information, see Cloudera Data Engineering service.
  1. Download the COD client configurations.
    For the Spark in CDE to connect to COD, it requires the hbase-site.xml configuration of the COD cluster. Refer to the following steps.
    1. Go to the COD UI and click on the test-cod database.
    2. Go to the Connect > HBase tab of the COD database, and copy the command under the HBase Client Configuration URL field.
      curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"
      • Ensure to provide the workload password for the curl command.
      • Explore the downloaded zip file to obtain the hbase-site.xml file.
  2. Create the HBase table.
    Create a new HBase table inside the COD database using the Hue link.
    1. Go to the COD UI and click on the test-cod database.
    2. Click on the Hue link under the SQL EDITOR field.
    3. On the Hue UI, click the HBase menu item on the left navigation panel. Click New Table.
    4. Choose a table name and column families, and click on the Submit button. For example, let us consider the table name testtable and a single column family testcf.
  3. Configure the job using CDE CLI.
    1. Configure CDE CLI to point to the virtual cluster created in the previous step. For more details, see Configuring the CLI client.
    2. Create resources using the following command.
      cde resource create --name cod-spark-resource
    3. Upload hbase-site.xml.
      cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml
    4. Upload the demo app jar that was built earlier.
      cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar
    5. Create the CDE job using a JSON definition.
      {
      "mounts": [
      {
      "resourceName": "cod-spark-resource"
      }
      ],
      "name": "my-cod-spark-job",
      "spark": {
      "className": "<YOUR MAIN CLASS>",
      "conf": {
      "spark.executor.extraClassPath": "/app/mount/conf",
      "spark.driver.extraClassPath": "/app/mount/conf"
      },
      "args": [ "<YOUR ARGS IF ANY>"],
      "driverCores": 1,
      "driverMemory": "1g",
      "executorCores": 1,
      "executorMemory": "1g",
      "file": "spark-hbase-project.jar",
      "pyFiles": [],
      "files": ["conf/hbase-site.xml"],
      "numExecutors": 4
      }
      }
    6. Import the job using the following command, assuming that the above JSON is saved as my-job-definition.json.
      cde job import --file my-job-definition.json

The spark.driver.extraClassPath and spark.executor.extraClassPath inside the job definition points to the same path which is used to upload the hbase-site.xml into the CDE resource.

This way the hbase-site.xml is automatically loaded from the classpath and you do not need to refer to it explicitly in your Spark code. You can define as follows.

val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)