Accessing CDW from CML

If you want to access data stored in a Cloudera Data Warehouse cluster from a CML workspace, you need to set up a connection. The CML and CDW instances must be within the same environment.

You need to add the following connection code to your project to establish the connection, and set the Spark properties described below. The properties can be set in the spark=defaults.conf file in the project, or in the Spark session itself.

Properties to set

Set the following properties in the following code sample to enable the connection.

  • spark.security.credentials.hiveserver2.enabled : FALSE
  • spark.datasource.hive.warehouse.read.via.llap : FALSE
  • spark.datasource.hive.warehouse.read.jdbc.mode : client
  • spark.sql.hive.hiveserver2.jdbc.url : <JDBC_URL>;user=<username>;password=<password>

Where:

  • JDBC_URL : JDBC URL fetched from the CDW virtual warehouse user interface.
  • Username : Username of the user.
  • Password : Password of the user.

Connection code

Enter this code in your project file, and run it in a session.

from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession

spark = SparkSession \
        .builder \
        .appName("Pyspark Test") \
        .config("spark.security.credentials.hiveserver2.enabled","false")\
        .config("spark.datasource.hive.warehouse.read.via.llap","false")\
        .config("spark.datasource.hive.warehouse.read.jdbc.mode", "client")\
        .config("spark.sql.hive.hiveserver2.jdbc.url", 
        "<JDBC_URL>;user=<Username>;password=<Password>")\
         .getOrCreate()
 
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()

Enable the connection

Follow this procedure to enable this connection.

  1. In CML workspace, create a Project.
  2. In the Project, create a Python script and add the connection code.
  3. In CDW, select Option menu > Copy JDBC String.
  4. Paste the JDBC string into the following code. Append the user and password.

Test the connection

Run the connection code in your session. You can then test the connection with the following commands:
  • Show the available databases: hive.showDatabases().show()
  • Set a database to use in the session: hive.setDatabase(“default”)
  • List the tables in a specific database: hive.showTables().show()
  • Run a SQL query: hive.sql(“select * from <table-name>”).show()