Accessing data with Spark

When you are using CDW, you can use JDBC connections.

JDBC is useful in the following cases:

  1. Use JDBC connections when you have fine-grained access.
  2. If the scale of data sent over the wire is on the order of tens of thousands of rows of data.

Add the Python code as described below, in the Session where you want to utilize the data, and update the code with the data location information.

Permissions

In addition, check with the Administrator that you have the correct permissions to access the data lake. You will need a role that has read access only.

How to obtain the Data Lake directory location

You need this location if you are using a Direct Reader connection.
  1. In the CDP home page, select Management Console.
  2. In Environments, select the environment you are using.
  3. In the tabbed section, select Cloud Storage.
  4. Choose the location where your data is stored.
  5. For managed data tables, copy the location shown for Hive Metastore Warehouse.
  6. For external unmanaged data tables, copy the location shown for Hive Metastore External Warehouse.
  7. Paste the location into the connection script in the designated position. If you are using AWS, the location starts with s3:, and if you are using Azure, it starts with abfs:. If you are using a different location in the data lake, the default path is shown by Hbase Root.

Set up a JDBC Connection

When using a JDBC connection, you read through a virtual warehouse that has Hive or Impala installed. You need to obtain the JDBC connection string, and paste it into the script in your session.

  1. In CDW, go to the Hive database containing your data.
  2. From the kebab menu, click Copy JDBC URL.
  3. Paste it into the script in your session.
  4. You also have to enter your user name and password in the script. You should set up environmental variables to store these values, instead of hardcoding them in the script.