Connect to CDW

The Data Connection Snippet feature now suggests using the cml.data library to connect to CDW virtual warehouses - these code snippets pop up as suggestions for every new session in a project. For further information, see Using data connection snippets.

However, if you would still like to use raw Python code to connect, follow the below details.

You can access data stored in the data lake using a Cloudera Data Warehouse cluster from a CML workspace, using the impyla Python package.

Configuring the connection

The CDW connection requires a WORKLOAD_PASSWORD that can be configured following the steps described in Setting the workload password, linked below.

The VIRTUAL_WAREHOUSE_HOSTNAME can be extracted from the JDBC URL that can be found in CDW, by selecting the Option menu > Copy JDBC String on a Virtual Warehouse.

For example, if the JDBC string copied as described above is:
jdbc:impala//<your-vw-host-name.site>/default;transportMode=http;httpPath=cliservice;socketTimeout=60;ssl=true;auth=browser;

Then, the extracted hostname to assign to the VWH_HOST is: <your-vw-host-name.site>

Connection code

Enter this code in your project file, and run it in a session.

# This code assumes the impyla package to be installed. 
# If not, please pip install impyla

from impala.dbapi import connect
import os
USERNAME=os.getenv(HADOOP_USER_NAME)
PASSWORD=os.getenv(WORKLOAD_PASSWORD)

VWH_HOST = "<<VIRTUAL_WAREHOUSE_HOSTNAME>>"
VWH_PORT = 443
conn = connect(host=VWH_HOST, port=VWH_PORT, auth_mechanism="LDAP", user=USERNAME,  password=PASSWORD, use_http_transport=True, http_path="cliservice", use_ssl=True)

dbcursor = conn.cursor()
dbcursor.execute("<<INSERT SQL QUERY HERE>>")
for row in dbcursor:
  print(row)


#Sample pandas code 
#from impala.util import as_pandas
#import pandas
#dbcursor = conn.cursor()
#dbcursor.execute("<<INSERT SQL QUERY HERE>>")
#tables = as_pandas(cursor)
#tables
#dbcursor.close()