Accessing Azure Storage account container from spark-shell
You can use spark-shell to query the files that are stored in the Azure Storage account. You should be able to access spark as an hdfs user using the ~ sudo -u hdfs -s command.
~ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/05/20 16:35:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Spark context available as 'sc' (master = yarn, app id = application_1589987399184_0004). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 188.8.131.52.1.1.0-413 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232) Type in expressions to have them evaluated. Type :help for more information. scala> val sampleRDD = sc.textFile("abfs://firstname.lastname@example.org/myDir/testingFile.txt") sampleRDD: org.apache.spark.rdd.RDD[String] = abfs://email@example.com/myDir/testingFile.txt MapPartitionsRDD at textFile at <console>:24 scala> sampleRDD.collect().foreach(println) testing the hadoop fs commands on azure.
Hence, after accessing the azure container, the data in sampleRDD would work like any other text file. Now, you can use any spark operation on these flies. For example,
- Find the word count of a particular word:
scala> sampleRDD.filter(line => line.contains("azure")).count() res4: Long = 1
- Use wordcount by MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows
scala> val wordcount = sampleRDD.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) wordcount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:26 scala> wordcount.collect() res7: Array[(String, Int)] = Array((commands,1), (testing,1), (hadoop,1), (on,1), (fs,1), (azure.,1), (the,1))