Testing your connectivity to the Cloudera Base on premises Cluster

Test that you can create a Project in your Cloudera AI Workbench and access data that is stored in the Cloudera Base on premises cluster.

Create a new Project, using the PySpark template.
Create a new file called testdata.txt (use this exact filename).
Add 2-3 lines of any text in the file to serve as sample data.

Run the following Spark commands to test the connection.

from pyspark.sql import SparkSession
                        
# Instantiate Spark-on-K8s Cluster
spark = SparkSession\
.builder\
.appName("Simple Spark Test")\
.config("spark.executor.memory","8g")\
.config("spark.executor.cores","2")\
.config("spark.driver.memory","2g")\
.config("spark.executor.instances","2")\
.getOrCreate()

# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql('insert into table testcml  select t.* from (select 1) t').show()
spark.sql('select * from testcml').show()

# Stop Spark Session
spark.stop()

Run the following direct HDFS commands to test the connection.

# Run sample HDFS commands
# Requires an additional testdata.txt file to be created with sample data in project home dir
!hdfs dfs -mkdir /tmp/testcml/
!hdfs dfs -copyFromLocal /home/cdsw/testdata.txt /tmp/testcml/
!hdfs dfs -cat /tmp/testcml/testdata.txt

If you get errors, then check with your Administrator to make sure that your user ID is set up in the Hadoop Authentication settings to access the Cloudera Base on premises cluster, and that the correct Ranger permissions have been applied.