Accessing data with Spark

When you are using CDW, you can use JDBC connections.

JDBC is useful in the following cases:

  1. Use JDBC connections when you have fine-grained access.
  2. If the scale of data sent over the wire is on the order of tens of thousands of rows of data.

Add the Python code as described below, in the Session where you want to utilize the data, and update the code with the data location information.

Permissions

In addition, check with the Administrator that you have the correct permissions to access the data lake. You will need a role that has read access only. For more information, see Accessing a Data Lake from CML.

Set up a JDBC Connection

When using a JDBC connection, you read through a virtual warehouse that has Hive or Impala installed. You need to obtain the JDBC connection string, and paste it into the script in your session.

  1. In CDW, go to the Hive database containing your data.
  2. From the kebab menu, click Copy JDBC URL.
  3. Paste it into the script in your session.
  4. You also have to enter your user name and password in the script. You should set up environmental variables to store these values, instead of hardcoding them in the script.