Connecting to Cloudera Data Warehouse
The Data Connection Snippet feature now suggests using
the cml.data
library to connect to Cloudera Data Warehouse virtual warehouses - these code
snippets pop up as suggestions for every new session in a project. For further information,
see Using data connection snippets.
However, if you would still like to use raw Python code to connect, follow the below details.
You can access data stored in the data lake using a Cloudera Data Warehouse cluster from a Cloudera AI Workbench, using the
impyla
Python package.
Configuring the connection
The Cloudera Data Warehouse connection requires a WORKLOAD_PASSWORD that can be configured following the steps described in Setting the workload password, linked below.
The VIRTUAL_WAREHOUSE_HOSTNAME can be extracted from the JDBC URL that can be found in Cloudera Data Warehouse, by selecting the on a Virtual Warehouse.
jdbc:impala//<your-vw-host-name.site>/default;transportMode=http;httpPath=cliservice;socketTimeout=60;ssl=true;auth=browser;
Then, the extracted hostname to assign to the VWH_HOST is: <your-vw-host-name.site>
Connection code
Enter this code in your project file, and run it in a session.
# This code assumes the impyla package to be installed.
# If not, please pip install impyla
from impala.dbapi import connect
import os
USERNAME=os.getenv(HADOOP_USER_NAME)
PASSWORD=os.getenv(WORKLOAD_PASSWORD)
VWH_HOST = "<<VIRTUAL_WAREHOUSE_HOSTNAME>>"
VWH_PORT = 443
conn = connect(host=VWH_HOST, port=VWH_PORT, auth_mechanism="LDAP", user=USERNAME, password=PASSWORD, use_http_transport=True, http_path="cliservice", use_ssl=True)
dbcursor = conn.cursor()
dbcursor.execute("<<INSERT SQL QUERY HERE>>")
for row in dbcursor:
print(row)
#Sample pandas code
#from impala.util import as_pandas
#import pandas
#dbcursor = conn.cursor()
#dbcursor.execute("<<INSERT SQL QUERY HERE>>")
#tables = as_pandas(cursor)
#tables
#dbcursor.close()