Use Direct Reader Mode with PySpark
You can use the Direct Reader Mode when your data has table-level access control, and does not have row-level security or column level masking (fine grained access.)
Obtain the location of the data lake before starting this task.
Continue with the next steps to set up the connection.
Replace
s3a://demo-aws-2/
in the code sample below with the
correct S3 bucket location. This sets the value for DATALAKE_DIRECTORY
.
The
DATALAKE_DIRECTORY
value is set for the spark.yarn.access.hadoopFileSystems
property in the corresponding config
statement.from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://demo-aws-2/"
spark = SparkSession\
.builder\
.appName("CDW-CML-Spark-Direct")\
.config("spark.sql.hive.hwc.execution.mode","spark")\
.config("spark.sql.extensions","com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
.config("spark.kryo.registrator","com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator")\
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)\
.config("spark.jars", "/opt/spark/optional-lib/hive-warehouse-connector-assembly.jar")\
.getOrCreate()
### The following commands test the connection
spark.sql("show databases").show()
spark.sql("describe formatted test_managed").show()
spark.sql("select * from test_managed").show()
spark.sql("describe formatted test_external").show()
spark.sql("select * from test_external").show()
from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://demo-aws-2/"
spark = SparkSession\
.builder\
.appName("CDW-CML-Spark-Direct")\
.config("spark.sql.hive.hwc.execution.mode","spark")\
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)\
.getOrCreate()
spark.sql("show databases").show()
spark.sql("describe formatted test_external").show()
spark.sql("select * from test_external").show()