Example: Connect a Spark session to the Data Lake

After the Admin sets up the correct permissions, you can access the Data Lake from your project, as this example shows.

Make sure you have access to a Data Lake containing your data.

The s3 or abfs path can be retrieved from the environment’s cloud storage. To read from this location, in the Machine Learning Workspaces UI, select the name of the environment, and in the Action menu, select Manage Access > IdBroker Mappings. Ensure that the user (or group) has been assigned an AWS IAM role that can read this location. For Fine Grained Access Control (Raz) enabled environments, go to Ranger to ensure that the user (or group) has access through an appropriate s3 or adls policy.

  1. Create a project in your ML Workspace.
  2. Create a file named spark-defaults.conf, or update the existing file with the property:
    • For S3: spark.yarn.access.hadoopFileSystems=s3a://STORAGE LOCATION OF ENV>
    • For ADLS: spark.yarn.access.hadoopFileSystems=abfs://STORAGE CONTAINER OF ENV>@STORAGE ACCOUNT OF ENV>
    Use the same location you defined in Setup Data Lake Access.
  3. Start a session (Python or Spark) and start a Spark session.

Setting up the project looks like this:

Now you can run Spark SQL commands. For example, you can:

  • Create a database foodb.

  • List databases and tables.

  • Create a table bartable.

  • Insert data into the table.

  • Query the data from the table.