Use Direct Reader Mode with PySpark

You can use the Direct Reader Mode when your data has table-level access control, and does not have row-level security or column level masking (fine grained access.)

Obtain the location of the data lake before starting this task.

Continue with the next steps to set up the connection.

Replace s3a://demo-aws-2/ in the code sample below with the correct S3 bucket location. This sets the value for DATALAKE_DIRECTORY.

The DATALAKE_DIRECTORY value is set for the spark.yarn.access.hadoopFileSystems property in the corresponding config statement.

from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://demo-aws-2/"

spark = SparkSession\
.builder\
.appName("CDW-CML-Spark-Direct")\
.config("spark.sql.hive.hwc.execution.mode","spark")\
.config("spark.sql.extensions","com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
.config("spark.kryo.registrator","com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator")\
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)\
.config("spark.jars", "/opt/spark/optional-lib/hive-warehouse-connector-assembly.jar")\
.getOrCreate()

### The following commands test the connection

spark.sql("show databases").show()
spark.sql("describe formatted test_managed").show()
spark.sql("select * from test_managed").show()
spark.sql("describe formatted test_external").show()
spark.sql("select * from test_external").show()

from pyspark.sql import SparkSession 
# Change to the appropriate Datalake directory location 
DATALAKE_DIRECTORY = "s3a://demo-aws-2/" 
spark = SparkSession\ 
.builder\ 
.appName("CDW-CML-Spark-Direct")\ 
.config("spark.sql.hive.hwc.execution.mode","spark")\ 
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)\ 
.getOrCreate() 

spark.sql("show databases").show() 
spark.sql("describe formatted test_external").show() 
spark.sql("select * from test_external").show()