Test your connectivity to the CDP-DC cluster

Test that you can create a Project in your ML Workspace and access data that is stored in the data center cluster.

  1. Create a new Project, using the PySpark template.
  2. Create a new file called testdata.txt (use this exact filename).
  3. Add 2-3 lines of any text in the file to serve as sample data.
  4. Run the following Spark commands to test the connection.
    from pyspark.sql import SparkSession
                            
    # Instantiate Spark-on-K8s Cluster
    spark = SparkSession\
    .builder\
    .appName("Simple Spark Test")\
    .config("spark.executor.memory","8g")\
    .config("spark.executor.cores","2")\
    .config("spark.driver.memory","2g")\
    .config("spark.executor.instances","2")\
    .getOrCreate()
    
    # Validate Spark Connectivity
    spark.sql("SHOW databases").show()
    spark.sql('create table testcml (abc integer)').show()
    spark.sql('insert into table testcml  select t.* from (select 1) t').show()
    spark.sql('select * from testcml').show()
    
    # Stop Spark Session
    spark.stop()
  5. Run the following direct HDFS commands to test the connection.
    # Run sample HDFS commands
    # Requires an additional testdata.txt file to be created with sample data in project home dir
    !hdfs dfs -mkdir /tmp/testcml/
    !hdfs dfs -copyFromLocal /home/cdsw/testdata.txt /tmp/testcml/
    !hdfs dfs -cat /tmp/testcml/testdata.txt
If you get errors, then check with your Admin to make sure that your user ID is set up in the Hadoop Authentication settings to access the CDP-DC cluster, and that the correct Ranger permissions have been applied.