Accessing Ozone from Spark
In CML, you can connect Spark to the Ozone object store with a script. The following example demonstrates how to do this.
This script, in Scala, counts the number of word occurrences in a text file. The key point in
this example is to use the following string to refer to the text file:
Word counting example in Scala
import sys.process._ // Put the input file into Ozone //"hdfs dfs -put data/jedi_wisdom.txt o3fs://hivetest.s3v.o3service1/spark" ! // Set the following spark setting in the file "spark-defaults.conf" on the CML session using terminal //spark.yarn.access.hadoopFileSystems=o3fs://hivetest.s3v.o3service1.neptunee01.olympus.cloudera.com:9862 //count lower bound val threshold = 2 // this file must already exist in hdfs, add a // local version by dropping into the terminal. val tokenized = sc.textFile("o3fs://hivetest.s3v.o3service1/spark/jedi_wisdom.txt").flatMap(_.split(" ")) // count the occurrence of each word val wordCounts = tokenized.map((_ , 1)).reduceByKey(_ + _) // filter out words with fewer than threshold occurrences val filtered = wordCounts.filter(_._2 >= threshold) System.out.println(filtered.collect().mkString(","))