Accessing Ozone from Spark

In CML, you can connect Spark to the Ozone object store with a script. The following example demonstrates how to do this.

This script, in Scala, counts the number of word occurrences in a text file. The key point in this example is to use the following string to refer to the text file: ofs://omservice1/s3v/hivetest/spark/jedi_wisdom.txt

Word counting example in Scala

import sys.process._
   
   // Put the input file into Ozone
   //"hdfs dfs -put data/jedi_wisdom.txt ofs://omservice1/s3v/hivetest/spark" !
   
   // Set the following spark setting in the file "spark-defaults.conf" on the CML session using terminal
   //spark.yarn.access.hadoopFileSystems=ofs://omservice1/s3v/hivetest
   
   //count lower bound
   val threshold = 2
   // this file must already exist in hdfs, add a
   // local version by dropping into the terminal.
   val tokenized = sc.textFile("ofs://omservice1/s3v/hivetest/spark/jedi_wisdom.txt").flatMap(_.split(" "))
   // count the occurrence of each word
   val wordCounts = tokenized.map((_ , 1)).reduceByKey(_ + _)
   // filter out words with fewer than threshold occurrences
   val filtered = wordCounts.filter(_._2 >= threshold)
   System.out.println(filtered.collect().mkString(","))