Set up a data connection to CDP DataHub

You can set up a data connection to the DataHub cluster. You can set up a connection using the New Connection dialog, or by using raw code inside your project. Both approaches are shown below.

  1. Log into the console web UI.
  2. Depending on which connection you want to use, click on either Connect to Hive or Connect to Impala tile.
  3. Ensure your Data Hub cluster name is correct in the popup.
  4. Copy the JDBC URL string.
  5. Now click on the Build a Data Science Project tile to log into the the ML workspace.
  6. In Site Administration > Data Connections, select New Connection.
  7. Return to the Machine Learning service. In Site Administration > Data Connections, select New Connection.
  8. Enter the connection name. You cannot have duplicate names for data connections within a workspace or within a given project.
  9. Select the connection type:
    1. Hive Virtual Warehouse
    2. Impala Virtual Warehouse
  10. Paste the JDBC URL for the data connection.
  11. (Optional) Enter the Virtual Warehouse Name. This is the name of the warehouse in Cloudera Data Warehouse.
The data connection is available to users by default. To change availability, click the Available switch. This switch determines if the data connection is displayed in Projects created within the workspace.

Set up a DataHub data connection using raw code

It is recommended to use the New Connection dialog to create a new data connection. If needed, you can also set up a data connection in your project code by using and adapting the following code snippet.

from impala.dbapi import connect
                
#Example connection string:
# jdbc:hive2://my-test-master0.eng-ml-i.svbr-nqvp.int.cldr.work/;ssl=true;transportMode=http;httpPath=my-test/cdp-proxy-api/hive

conn = connect(
    host = "my-test-master0.eng-ml-i.svbr-nqvp.int.cldr.work",
    port = 443,
    auth_mechanism = "LDAP",
    use_ssl = True,
    use_http_transport = True,
    http_path = "my-test/cdp-proxy-api/hive",
    user = "csso_me",
    password = "Test@123")
cursor = conn.cursor()
cursor.execute("select * from 3yearpop")

for row in cursor:
    print(row)
cursor.close()
conn.close()