Running your first Spark application

The simplest way to run a Spark application is by using the Scala or Python shells.

  1. To start one of the shell applications, run one of the following commands:
    • Scala:
      $ /bin/spark-shell
      ...
      Spark context Web UI available at ...
      Spark context available as 'sc' (master = yarn, app id = ...).
      Spark session available as 'spark'.
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version ...
            /_/
      
      Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala>
    • Python:
      $ /bin/pyspark
      Python 2.7.5 (default, Sep 12 2018, 05:31:16) 
      [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      ...
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version ...
            /_/
      
      Using Python version 2.7.5 (default, Sep 12 2018 05:31:16)
      SparkSession available as 'spark'.
      >>>

    SPARK_HOME defaults to /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.

    For a complete list of shell options, run spark-shell or pyspark with the -h flag.

  2. To run the classic Hadoop word count application, copy an input file to HDFS:
    hdfs dfs -copyFromLocal input s3a://<bucket_name>/
  3. Within a shell, run the word count application using the following code examples, substituting for namenode_host, path/to/input, and path/to/output:
    Scala:
    scala> val myfile = sc.textFile("s3a://<bucket_name>/path/to/input")
    scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    scala> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")
    Python:
    >>> myfile = sc.textFile("s3a://bucket_name/path/to/input")
    >>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
    >>> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")