Accessing Azure Storage account container from spark-shell

You can use spark-shell to query the files that are stored in the Azure Storage account. You should be able to access spark as an hdfs user using the ~ sudo -u hdfs -s command.

~ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/05/20 16:35:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context available as 'sc' (master = yarn, app id = application_1589987399184_0004).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val sampleRDD = sc.textFile("abfs://")
sampleRDD: org.apache.spark.rdd.RDD[String] = abfs:// MapPartitionsRDD[1] at textFile at <console>:24

scala> sampleRDD.collect().foreach(println)
testing the hadoop fs commands on azure.

Hence, after accessing the azure container, the data in sampleRDD would work like any other text file. Now, you can use any spark operation on these flies. For example,

  • Find the word count of a particular word:
    scala> sampleRDD.filter(line => line.contains("azure")).count()
    res4: Long = 1
  • Use wordcount by MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
    scala> val wordcount = sampleRDD.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
    wordcount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:26
    scala> wordcount.collect()
    res7: Array[(String, Int)] = Array((commands,1), (testing,1), (hadoop,1), (on,1), (fs,1), (azure.,1), (the,1))