Accessing Ozone from Spark

In CML, you can connect Spark to the Ozone object store with a script. The following example demonstrates how to do this.

This script, in Scala, counts the number of word occurrences in a text file. The key point in this example is to use the following string to refer to the text file: o3fs://hivetest.s3v.o3service1/spark/jedi_wisdom.txt

Word counting example in Scala

import sys.process._
   
// Put the input file into Ozone
//"hdfs dfs -put data/jedi_wisdom.txt o3fs://hivetest.s3v.o3service1/spark" !

// Set the following spark setting in the file "spark-default.conf" on the CML session using terminal
//spark.yarn.access.hadoopFileSystems=o3fs://hivetest.s3v.o3service1.neptunee01.olympus.cloudera.com:9862

//count lower bound
val threshold = 2
// this file must already exist in hdfs, add a
// local version by dropping into the terminal.
val tokenized = sc.textFile("o3fs://hivetest.s3v.o3service1/spark/jedi_wisdom.txt").flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_ , 1)).reduceByKey(_ + _)
// filter out words with fewer than threshold occurrences
val filtered = wordCounts.filter(_._2 >= threshold)
System.out.println(filtered.collect().mkString(","))