Test Your Connectivity to the Cloudera-Data Center Cluster

Test that you can create a Project in your Cloudera AI Workbench and access data that is stored in the data center cluster.

Create a new Project, using the PySpark template.
Create a new file called testdata.txt (use this exact filename).
Add 2-3 lines of any text in the file to serve as sample data.

Run the following Spark commands to test the connection.

from pyspark.sql import SparkSession
                        
# Instantiate Spark-on-K8s Cluster
spark = SparkSession\
.builder\
.appName("Simple Spark Test")\
.config("spark.executor.memory","8g")\
.config("spark.executor.cores","2")\
.config("spark.driver.memory","2g")\
.config("spark.executor.instances","2")\
.getOrCreate()

# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql('insert into table testcml  select t.* from (select 1) t').show()
spark.sql('select * from testcml').show()

# Stop Spark Session
spark.stop()

Run the following direct HDFS commands to test the connection.

# Run sample HDFS commands
# Requires an additional testdata.txt file to be created with sample data in project home dir
!hdfs dfs -mkdir /tmp/testcml/
!hdfs dfs -copyFromLocal /home/cdsw/testdata.txt /tmp/testcml/
!hdfs dfs -cat /tmp/testcml/testdata.txt

If you get errors, then check with your Administrator to make sure that your user ID is set up in the Hadoop Authentication settings to access the Cloudera-DC cluster, and that the correct Ranger permissions have been applied.