Test that you can create a Project in your Cloudera AI Workbench and
access data that is stored in the Cloudera Base on premises
cluster.
-
Create a new Project, using the PySpark template.
-
Create a new file called testdata.txt (use this exact filename).
-
Add 2-3 lines of any text in the file to serve as sample data.
-
Run the following Spark commands to test the connection.
from pyspark.sql import SparkSession
# Instantiate Spark-on-K8s Cluster
spark = SparkSession\
.builder\
.appName("Simple Spark Test")\
.config("spark.executor.memory","8g")\
.config("spark.executor.cores","2")\
.config("spark.driver.memory","2g")\
.config("spark.executor.instances","2")\
.getOrCreate()
# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql('insert into table testcml select t.* from (select 1) t').show()
spark.sql('select * from testcml').show()
# Stop Spark Session
spark.stop()
-
Run the following direct HDFS commands to test the connection.
# Run sample HDFS commands
# Requires an additional testdata.txt file to be created with sample data in project home dir
!hdfs dfs -mkdir /tmp/testcml/
!hdfs dfs -copyFromLocal /home/cdsw/testdata.txt /tmp/testcml/
!hdfs dfs -cat /tmp/testcml/testdata.txt
If you get errors, then check with your Administrator to make sure that your user
ID is set up in the Hadoop Authentication settings to access the Cloudera Base on premises cluster, and that the correct
Ranger permissions have been applied.