Running your first Spark application

important

By default, CDH is configured to permit any user to access the Hive Metastore. However, if you have modified the value set for the configuration property hadoop.proxyuser.hive.groups, which can be modified in Cloudera Manager by setting the Hive Metastore Access Control and Proxy User Groups Override property, your Spark application might throw exceptions when it is run. To address this issue, make sure you add the groups that contain the Spark users that you want to have access to the metastore when Spark applications are run to this property in Cloudera Manager:

In the Cloudera Manager Admin Console Home page, click the Hive service.
On the Hive service page, click the Configuration tab.
In the Search well, type hadoop.proxyuser.hive.groups to locate the Hive Metastore Access Control and Proxy User Groups Override property.
Click the plus sign (+), enter the groups you want to have access to the metastore, and then click Save Changes. You must restart the Hive Metastore Server for the changes to take effect by clicking the restart icon at the top of the page.

The simplest way to run a Spark application is by using the Scala or Python shells.

To start one of the shell applications, run one of the following commands:

Scala:

$ /bin/spark-shell
...
Spark context Web UI available at ...
Spark context available as 'sc' (master = yarn, app id = ...).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version ...
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Python:

$ /bin/pyspark
Python 2.7.5 (default, Sep 12 2018, 05:31:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version ...
      /_/

Using Python version 2.7.5 (default, Sep 12 2018 05:31:16)
SparkSession available as 'spark'.
>>>

SPARK_HOME defaults to /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.

For a complete list of shell options, run spark-shell or pyspark with the -h flag.

To run the classic Hadoop word count application, copy an input file to HDFS:
```
hdfs dfs -copyFromLocal input s3a://<bucket_name>/
```

Within a shell, run the word count application using the following code examples, substituting for namenode_host, path/to/input, and path/to/output:

Scala:

scala> val myfile = sc.textFile("s3a://<bucket_name>/path/to/input")
scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
scala> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")

Python:

>>> myfile = sc.textFile("s3a://bucket_name/path/to/input")
>>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile("s3a://<bucket_name>/path/to/output")