Accessing data with Spark

When you are using Cloudera Data Warehouse (CDW), you can use Java Database Connectivity (JDBC).

JDBC is useful in the following cases:

  1. Use JDBC connections when you have fine-grained access.
  2. Use JDBC if the scale of data sent over the wire is on the order of tens of thousands of rows of data.

Add the Python code as described below, in the session where you want to utilize the data, and update the code with the data location information.

Permissions

In addition, check with the Administrator that you have the correct permissions to access the data lake. You will need a role that has read access only.

Obtaining the Data Lake directory location

You need this location if you are using a Direct Reader connection.
  1. Select Management Console in the CDP home page.
  2. Select the environment you are using in Environments.
  3. Select Cloud Storage in the tabbed section.
  4. Choose the location where your data is stored.
  5. For managed data tables, copy the location shown for Hive Metastore Warehouse.
  6. For external unmanaged data tables, copy the location shown for Hive Metastore External Warehouse.
  7. Paste the location into the connection script in the designated position. If you are using AWS, the location starts with s3:, and if you are using Azure, it starts with abfs:. If you are using a different location in the data lake, the default path is shown by Hbase Root.

Setting up a JDBC connection

When using a JDBC connection, you read through a virtual warehouse that has Hive or Impala installed. You need to obtain the JDBC connection string, and paste it into the script in your session.

  1. In CDW, go to the Hive database containing your data.
  2. From the kebab menu, click Copy JDBC URL.
  3. Paste it into the script in your session.
  4. Enter your user name and password in the script. Set up environmental variables to store these values, instead of hardcoding them in the script.