Connecting Cloudera Data Engineering to Apache Phoenix

Apache Phoenix acts as a relational database engine and a programmatic, ANSI-compliant SQL layer on top of Apache HBase within the Cloudera ecosystem. When integrated with Cloudera Data Engineering, it enables Data Engineers to run high-performance Apache Spark batch jobs and pipelines against the Cloudera Operational Database using familiar SQL syntax rather than complex NoSQL APIs. Connect to Phoenix using Cloudera Data Engineering on the base cluster.

  1. You must download the following JAR files for compiling Scala application:
    • hbase-shaded-mapreduce-[***HBASE-CLOUDERA-RUNTIME-VERSION***].jar
    • phoenix5-spark3-shaded-[***PHOENIX-CLOUDERA-RUNTIME-VERSION***].jar
    • spark-sql_2.12-[***SPARK3-CLOUDERA-RUNTIME-VERSION***].jar
    • opentelemetry-api-0.12.0.jar
    • spark-catalyst_2.12-[***SPARK3-CLOUDERA-RUNTIME-VERSION***].jar
    • opentelemetry-context-0.12.0.jar
    • spark-core_2.12-[***SPARK3-CLOUDERA-RUNTIME-VERSION***].jar
  2. You must get the hbase-site.xml file from the base cluster.
    1. Open your terminal and log into the HBase Gateway node using your SSH credentials.
    2. Change your current working directory to /etc/hbase/conf and list the files in the directory.
      $ cd /etc/hbase/conf
      $ ls
      atlas-application.properties  __cloudera_metadata__  hbase-env.sh    hdfs-site.xml  log4j.properties  ssl-client.xml
      __cloudera_generation__       core-site.xml         hbase-site.xml  jaas.conf        ozone-site.xml
    3. Copy the hbase-site.xml file.
  3. You must provide read, write, run, and create permissions to the workload user for the Phoenix table from the Ranger UI. For more information, see Configure a resource-based policy: HBase.

    Also provide read, write, and run permissions for the user on SYSTEM:STATS Phoenix table. For more information, see Phoenix - Part 4 - working with Ranger.

  4. Create a Phoenix table using on of the following methods:
    • Using the phoenix-sqlline tool:
      1. Open your terminal and log into the HBase Gateway node using your SSH credentials.
      2. Run the kinit command using either the HBase administrator principal or a workload user principal that has access permissions to Phoenix.
      3. Connect to the cluster using the phoenix-sqlline tool and create the table.
        $ phoenix-sqlline
        Setting property: [incremental, false]
        Setting property: [isolation, TRANSACTION_READ_COMMITTED]
        issuing: !connect -p driver org.apache.phoenix.jdbc.PhoenixDriver -p user "none" -p password "none" "jdbc:phoenix:"
        Connecting to jdbc:phoenix:
        26/03/24 20:07:06 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-phoenix.properties,hadoop-metrics2.properties
        Connected to: Phoenix (version 5.1)
        Driver: PhoenixEmbeddedDriver (version 5.1)
        Autocommit status: true
        Transaction isolation: TRANSACTION_READ_COMMITTED
        sqlline version 1.9.0
        0: jdbc:phoenix:> CREATE TABLE IF NOT EXISTS employees (id BIGINT NOT NULL PRIMARY KEY, name VARCHAR);
        No rows affected (0.015 seconds)
        0: jdbc:phoenix:> SELECT TABLE_NAME FROM SYSTEM.CATALOG WHERE TABLE_TYPE='u';
        +------------+
        | TABLE_NAME |
        +------------+
        | EMPLOYEES  |
        +------------+
        1 row selected (0.028 seconds)
        0: jdbc:phoenix:> SELECT COLUMN_NAME, DATA_TYPE FROM SYSTEM.CATALOG WHERE TABLE_NAME='EMPLOYEES';
        +-------------+-----------+
        | COLUMN_NAME | DATA_TYPE |
        +-------------+-----------+
        |             | null      |
        |             | null      |
        | ID          | 4         |
        | NAME        | 12        |
        +-------------+-----------+
        4 rows selected (0.016 seconds)
    • Using Cloudera Data Engineering jobs:
      1. Use the following Scala application code to create a Phoenix table in the default namespace or schema. The workload user must have access to create tables within the default namespace or schema.
      import java.sql.DriverManager
      import org.apache.spark.sql.{SparkSession, SaveMode}
      import java.util.Arrays
      import org.apache.spark.sql.types._
      import org.apache.spark.sql.Row
      
      object PhoenixCreate {
        def main(args: Array[String]): Unit = {
          val spark = SparkSession.builder()
            .appName("Phoenix-Scala-DDL-Write")
            .getOrCreate()
      
          val zkUrl = "[***ZOOKEEPER-URL***]"
          val jdbcUrl = s"jdbc:phoenix:$zkUrl"
          val tableName = "employees"
      
          val createTableSql = s"""
            CREATE TABLE IF NOT EXISTS $tableName (
              id INTEGER NOT NULL PRIMARY KEY,
              name VARCHAR
            )
          """
          
          var conn: java.sql.Connection = null
          try {
            conn = DriverManager.getConnection(jdbcUrl)
            val stmt = conn.createStatement()
            stmt.executeUpdate(createTableSql)
            stmt.close()
          } catch {
            case e: Exception => e.printStackTrace()
          } finally {
            if (conn != null) conn.close()
          }
          spark.stop()
        }
      }
  1. Create a new project with the files for the Cloudera Data Engineering jobs for Phoenix.
  2. Add the downloaded JAR files to the lib directory and build the project.
    Example
    $ ls lib/                                                                                                                            
    hbase-shaded-mapreduce-2.4.17.7.1.9.1064-1.jar   phoenix5-spark3-shaded-6.0.0.3.3.7191000.13-1.jar   spark-sql_2.12-3.3.2.3.3.7191000.13-1.jar
    opentelemetry-api-0.12.0.jar                     spark-catalyst_2.12-3.3.2.3.3.7191000.13-1.jar
    opentelemetry-context-0.12.0.jar                 spark-core_2.12-3.3.2.3.3.7191000.13-1.jar
    
    $ sbt compile && sbt package
    The Cloudera Data Engineering job JAR files are present in target/scala-[***SCALA-VERSION***] directory.
  3. In the Cloudera console, click the Data Engineering tile. The Cloudera Data Engineering Home page is displayed.
  4. Click Resources in the left navigation menu. The Resources page is displayed.
  5. Click the Create Resource button.
  6. In the Name filed, enter phoenix-resources and click Save.
  7. Go to the Details tab.
  8. Click the Upload Files button and upload the following files:
    • The downloaded JAR files
    • The hbase-site.xml file
    • The Cloudera Data Engineering job JAR files
  9. Create a new job named phoenix-table. For instructions about creating jobs, see Creating jobs in Cloudera Data Engineering.
    1. Create a Cloudera Data Engineering job with the following details and attach the hbase-site.xml file in the resources:
      • Name: phoenix-table
      • Application File: Upload the phoenix-scala_[***SCALA-VERSION***].jar file.
      • Main Class: PhoenixReadWriteExecute
      • Spark Configurations:

        Example

        spark.driver.extraClassPath=/app/mount/
        spark.executor.extraClassPath=/app/mount/
      • Files and Resources: Upload the hbase-site.xml file and the downloaded JAR files.
    2. Click Create and Run.
  10. In the Job Runs, verify if the phoenix-table job passes.
  11. Validate if the Phoenix table is created.
  12. Open your terminal and SSH into the Gateway node for HBase from the base cluster and run the following command:
    $ phoenix-sqlline
    Setting property: [incremental, false]
    Setting property: [isolation, TRANSACTION_READ_COMMITTED]
    issuing: !connect -p driver org.apache.phoenix.jdbc.PhoenixDriver -p user "none" -p password "none" "jdbc:phoenix:"
    Connecting to jdbc:phoenix:
    26/03/24 19:55:51 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-phoenix.properties,hadoop-metrics2.properties
    Connected to: Phoenix (version 5.1)
    Driver: PhoenixEmbeddedDriver (version 5.1)
    Autocommit status: true
    Transaction isolation: TRANSACTION_READ_COMMITTED
    sqlline version 1.9.0
    0: jdbc:phoenix:> SELECT TABLE_NAME FROM SYSTEM.CATALOG WHERE TABLE_TYPE='u';
    +------------+
    | TABLE_NAME |
    +------------+
    | EMPLOYEES  |
    +------------+
    1 row selected (0.026 seconds)
    0: jdbc:phoenix:> SELECT * FROM EMPLOYEES;
    +----+-------+
    | ID | NAME  |
    +----+-------+
    | 1  | Alice |
    | 2  | Bob   |
    | 3  | Peter |
    | 4  | John  |
    +----+-------+
    4 rows selected (0.03 seconds)