Running Phoenix and HBase Spark Applications using CDE

Cloudera Data Engineering (CDE) supports running Spark applications in the CDP public cloud. You can use CDE to run Phoenix and HBase Spark applications against COD.

  • Set your CDP workload password. For more information, see Setting the workload password.
  • Synchronize users from the User Management Service in the CDP Control Plane into the environment in which your COD database is running.
  • Ensure CDE service is enabled and a virtual cluster is created in the Data Engineering Experience. For more information, see Creating virtual clusters and Enabling a Cloudera Data Engineering service.
  1. Set HBase and Phoenix versions in your Maven project.
    1. Use the describe-client-connectivity command for the HBase and Phoenix version information. The following code snippet shows fetching the database connectivity information and parsing the required HBase and Phoenix information to build your application.
      echo "HBase version"
      cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"hbase\") | .version"
      echo "Phoenix Connector Version"
      cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"phoenix-thick-jdbc\") | .version"
      HBase Version
      2.4.6.7.2.14.0-133
      Phoenix Spark Version
      "6.0.0.7.2.14.0-133"
    2. Update the HBase and Phoenix connector versions in our Maven project or configuration.
      <properties>
      ...
          <phoenix.connector.version>6.0.0.7.2.14.0-133</phoenix.connector.version>
          <hbase.version>2.4.6.7.2.14.0-133</hbase.version>
      ...
      </properties>
  2. Download hbase-site.xml and hbase-omid-client-config.yml configuration files.
    1. Use the describe-client-connectivity command to determine the client configuration URL.
      cdp opdb describe-client-connectivity --database-name spark-connector --environment-name cod-7213 | jq ".connectors[] | select(.name == \"hbase\") | .configuration.clientConfigurationDetails[] | select(.name == \"HBASE\") | .url "
      "https://cod--XXXXXX-gateway0..xcu2-8y8x.dev.cldr.work/clouderamanager/api/v41/clusters/XXXXX/services/hbase/clientConfig"
    2. Use the URL gathered from the previous command and run the curl command to download the HBase configurations.
      curl -f -o "hbase-config.zip" -u "<csso_user>" "https://cod--XXXXXX-gateway0.cod-7213....xcu2-8y8x.dev.cldr.work/clouderamanager/api/v41/clusters/cod--XXXX/services/hbase/clientConfig"
    3. Unzip hbase-config.zip and copy the hbase-site.xml and hbase-omid-client-config.yml to src/main/resources path in the Maven project.
      unzip hbase-conf.zip
      cp hbase-conf/hbase-site.xml <path to src/main/resources>
      cp hbase-conf/hbase-omid-client-config.yml <path to src/main/resources>
  3. Build the project.
    $ mvn package
  4. Create a CDE job.
    1. Configure CDE CLI to point to the virtual cluster. For more information, see Downloading the Cloudera Data Engineering command line interface.
    2. Create a resource using the following command.
      cde resource create --name phoenix-spark-app-resource
    3. Upload the required jars which you downloaded while building the project.
      cde resource upload --name spark-app-resource --local-path ./target/connector-libs/hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar --resource-path hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar
      cde resource upload --name spark-app-resource --local-path ./target/connector-libs/opentelemetry-api-0.12.0.jar --resource-path opentelemetry-api-0.12.0.jar
      cde resource upload --name spark-app-resource --local-path ./target/connector-libs/opentelemetry-context-0.12.0.jar --resource-path opentelemetry-context-0.12.0.jar
      cde resource upload --name spark-app-resource --local-path ./target/connector-libs/phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar --resource-path phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar
      
    4. Upload the Spark application app jar that you had built earlier.
      cde resource upload --name spark-app-resource --local-path ./target/phoenix-spark-transactions-0.1.0.jar --resource-path phoenix-spark-transactions-0.1.0.jar
    5. Replace HBase, Phoenix, and Phoenix Spark connector versions in the spark-job.json as shown in the following sample and create a CDE job using the following JSON and import commands.
      {
         "mounts":[
            {
               "resourceName":"phoenix-spark-app-resource"
            }
         ],
         "name":"phoenix-spark-app",
         "spark":{
            "className":"com.cloudera.cod.examples.spark.SparkApp",
            "args":[
               "{{ phoenix_jdbc_url }}"
            ],
            "driverCores":1,
            "driverMemory":"1g",
            "executorCores":1,
            "executorMemory":"1g",
            "file":"phoenix-spark-transactions-0.1.0.jar",
            "pyFiles":[
               
            ],
            "files":[
              "hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar",
              "opentelemetry-api-0.12.0.jar",
              "opentelemetry-context-0.12.0.jar"
              "phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar",
            ],
            "numExecutors":4
         }
      }
      cde job import --file spark-job.json
      
  5. Run the project.
    1. Use the describe-client-connectivity command to determine the base JDBC URL to pass.
      cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"phoenix-thick-jdbc\") | .configuration.jdbcUrl"
    2. Run the job by passing the JDBC URL obtained from the previous command, as an argument to the job.
      cde job run --name phoenix-spark-app --variable phoenix_jdbc_url=<phoenix_jdbc_url>