Use Direct Reader Mode with PySpark
Make sure to update the following parameters in the code sample below:
-
spark.yarn.access.hadoopFileSystems
: Enter the location where your data is stored. -
spark.jars
: Update the Hive Warehouse Connector.jar
file, if necessary.
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("CDW-CML-Spark-Direct")\
.config("spark.sql.hive.hwc.execution.mode","spark")\
.config("spark.sql.extensions","com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
.config("spark.kryo.registrator","com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator")\
.config("spark.yarn.access.hadoopFileSystems","s3a://demo-aws-2/")\
.config("spark.jars", "/usr/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.2.2.0-244.jar")\
.getOrCreate()
### The following commands test the connection
spark.sql("show databases").show()
spark.sql("describe formatted test_managed").show()
spark.sql("select * from test_managed").show()
spark.sql("describe formatted test_external").show()
spark.sql("select * from test_external").show()