Running your first Spark application
The simplest way to run a Spark application is by using the Scala or Python shells.
- To start one of the shell applications, run one of the
following commands:
- Scala:
$ /bin/spark-shell ... Spark context Web UI available at ... Spark context available as 'sc' (master = yarn, app id = ...). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version ... /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala>
- Python:
$ /bin/pyspark Python 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information. ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version ... /_/ Using Python version 2.7.5 (default, Sep 12 2018 05:31:16) SparkSession available as 'spark'. >>>
SPARK_HOME defaults to
/opt/cloudera/parcels/CDH/lib/spark
in parcel installations. In a Cloudera Manager deployment, the shells are also available from/usr/bin
.For a complete list of shell options, run
spark-shell
orpyspark
with the-h
flag. - Scala:
- To run the classic Hadoop word count application, copy an
input file to HDFS:
hdfs dfs -copyFromLocal input s3a://<bucket_name>/
- Within a shell, run the word count application using the
following code examples, substituting for
namenode_host, path/to/input,
and path/to/output:
Scala:
scala> val myfile = sc.textFile("s3a://<bucket_name>/path/to/input") scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")
Python:>>> myfile = sc.textFile("s3a://bucket_name/path/to/input") >>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2) >>> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")