Connecting to Ozone filesystem
In Cloudera Machine Learning (CML), you can connect Spark to the Ozone object store with a script.
The script, in Scala, counts the number of word occurrences in a text file. The key point in
this example is to use the following string to refer to the text file:
ofs://omservice1/s3v/hivetest/spark/jedi_wisdom.txt
Word counting example in Scala
import sys.process._
// Put the input file into Ozone
//"hdfs dfs -put data/jedi_wisdom.txt ofs://omservice1/s3v/hivetest/spark" !
// Set the following spark setting in the file "spark-defaults.conf" on the CML session using terminal
//spark.yarn.access.hadoopFileSystems=ofs://omservice1/s3v/hivetest
//count lower bound
val threshold = 2
// this file must already exist in hdfs, add a
// local version by dropping into the terminal.
val tokenized = sc.textFile("ofs://omservice1/s3v/hivetest/spark/jedi_wisdom.txt").flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_ , 1)).reduceByKey(_ + _)
// filter out words with fewer than threshold occurrences
val filtered = wordCounts.filter(_._2 >= threshold)
System.out.println(filtered.collect().mkString(","))