Accessing data with Spark

There are two general methods to access data with Spark, with various benefits and disadvantages, depending on how the data is managed. The two methods are Direct Reader and JDBC.

  1. JDBC: Use in the following cases:
    1. Use JDBC connections when you have fine-grained access.
    2. If the scale of data sent over the wire is on the order of tens of thousands of rows of data.
  2. Direct Reader: Use in the following cases:
    1. Your data has table-level access control, and does not have row-level security or column level masking (fine-grained access.)
    2. When you expect to have less than several GB of output.

For both methods, add the Python or R code, as described below, in the Session where you want to utilize the data, and update the code with the data location information.

Permissions

In addition, check with the Administrator that you have the correct permissions to access the data lake. You will need a role that has read access only.

How to obtain the Data Lake directory location

  1. In the CDP home page, select Management Console.
  2. In Environments, select the Environment you are using.
  3. In the tabbed section, select Cloud Storage.
  4. Choose the location where your data is stored.
  5. For managed data tables, copy the location shown by Hive Metastore Warehouse.
  6. For external unmanaged data tables, copy the location shown by Hive Metastore External Warehouse
  7. Paste the location into the script, as shown in the example. If you are using AWS, the location starts with s3:, and if you are using Azure, it starts with abfs:. If you are using a different location in the data lake, the default path is shown by Hbase Root.

Check for an updated version of the jar file

This jar file name may need to be updated with the correct version number.

  1. Click Terminal Access, and run: ls /usr/lib/hive_warehouse_connector/
  2. Check the file name and version number of the jar file.

Set up a JDBC Connection

When using a JDBC connection, you read through a virtual warehouse that has Hive or Impala installed. You need to obtain the JDBC connection string, and paste it into the script in your session.

  1. In CDW, go to the Hive database containing your data.
  2. From the kebab menu, click Copy JDBC URL.
  3. Paste it into the script in your session.
  4. You also have to enter your user name and password in the script. You should set up environmental variables to store these values, instead of hardcoding them in the script.