Test that you can create a Project in your Cloudera Machine Learning Workspace and access data that is
stored in the data center cluster.
-
Create a new Project, using the PySpark template.
-
Create a new file called testdata.txt (use this exact filename).
-
Add 2-3 lines of any text in the file to serve as sample data.
-
Run the following Spark commands to test the connection.
from pyspark.sql import SparkSession
# Instantiate Spark-on-K8s Cluster
spark = SparkSession\
.builder\
.appName("Simple Spark Test")\
.config("spark.executor.memory","8g")\
.config("spark.executor.cores","2")\
.config("spark.driver.memory","2g")\
.config("spark.executor.instances","2")\
.getOrCreate()
# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql('insert into table testcml select t.* from (select 1) t').show()
spark.sql('select * from testcml').show()
# Stop Spark Session
spark.stop()
-
Run the following direct HDFS commands to test the connection.
# Run sample HDFS commands
# Requires an additional testdata.txt file to be created with sample data in project home dir
!hdfs dfs -mkdir /tmp/testcml/
!hdfs dfs -copyFromLocal /home/cdsw/testdata.txt /tmp/testcml/
!hdfs dfs -cat /tmp/testcml/testdata.txt
If you get errors, then check with your Administrator to make sure that your user
ID is set up in the Hadoop Authentication settings to access the Cloudera-DC cluster, and that the correct Ranger
permissions have been applied.