Accessing CDW from CML
If you want to access data stored in a Cloudera Data Warehouse cluster from a CML workspace, you need to set up a connection. The CML and CDW instances must be within the same environment.
You need to add the following connection code to your project to establish the connection, and
set the Spark properties described below. The properties can be set in the
spark=defaults.conf
file in the project, or in the Spark session itself.
Properties to set
Set the following properties in the following code sample to enable the connection.
spark.security.credentials.hiveserver2.enabled : FALSE
spark.datasource.hive.warehouse.read.via.llap : FALSE
spark.datasource.hive.warehouse.read.jdbc.mode : client
spark.sql.hive.hiveserver2.jdbc.url : <JDBC_URL>;user=<username>;password=<password>
Where:
JDBC_URL
: JDBC URL fetched from the CDW virtual warehouse user interface.Username
: Username of the user.Password
: Password of the user.
Connection code
Enter this code in your project file, and run it in a session.
from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession
spark = SparkSession \
.builder \
.appName("Pyspark Test") \
.config("spark.security.credentials.hiveserver2.enabled","false")\
.config("spark.datasource.hive.warehouse.read.via.llap","false")\
.config("spark.datasource.hive.warehouse.read.jdbc.mode", "client")\
.config("spark.sql.hive.hiveserver2.jdbc.url",
"<JDBC_URL>;user=<Username>;password=<Password>")\
.getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
Enable the connection
Follow this procedure to enable this connection.
- In CML workspace, create a Project.
- In the Project, create a Python script and add the connection code.
- In CDW, select .
- Paste the JDBC string into the following code. Append the user and password.
Test the connection
Run the connection code in your session. You can then test the connection with the following
commands:
- Show the available databases: hive.showDatabases().show()
- Set a database to use in the session: hive.setDatabase(“default”)
- List the tables in a specific database: hive.showTables().show()
- Run a SQL query: hive.sql(“select * from <table-name>”).show()