Connect to CDW
The Data Connection Snippet feature now suggests using the cml.data
library to connect to CDW virtual warehouses - these code snippets pop up as suggestions for
every new session in a project. For further information, see Using data connection snippets.
However, if you would still like to use raw Python code to connect, follow the below details.
You can access data stored in the data lake using a Cloudera Data Warehouse cluster from a CML workspace, using the
impyla
Python package.
Configuring the connection
The CDW connection requires a WORKLOAD_PASSWORD that can be configured following the steps described in Setting the workload password, linked below.
The VIRTUAL_WAREHOUSE_HOSTNAME can be extracted from the JDBC URL that can be found in CDW, by selecting the
on a Virtual Warehouse.jdbc:impala//<your-vw-host-name.site>/default;transportMode=http;httpPath=cliservice;socketTimeout=60;ssl=true;auth=browser;
Then, the extracted hostname to assign to the VWH_HOST is: <your-vw-host-name.site>
Connection code
Enter this code in your project file, and run it in a session.
# This code assumes the impyla package to be installed. # If not, please pip install impyla from impala.dbapi import connect import os USERNAME=os.getenv(HADOOP_USER_NAME) PASSWORD=os.getenv(WORKLOAD_PASSWORD) VWH_HOST = "<<VIRTUAL_WAREHOUSE_HOSTNAME>>" VWH_PORT = 443 conn = connect(host=VWH_HOST, port=VWH_PORT, auth_mechanism="LDAP", user=USERNAME, password=PASSWORD, use_http_transport=True, http_path="cliservice", use_ssl=True) dbcursor = conn.cursor() dbcursor.execute("<<INSERT SQL QUERY HERE>>") for row in dbcursor: print(row) #Sample pandas code #from impala.util import as_pandas #import pandas #dbcursor = conn.cursor() #dbcursor.execute("<<INSERT SQL QUERY HERE>>") #tables = as_pandas(cursor) #tables #dbcursor.close()