Introduction
You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
You can:
-
Submit interactive statements through the Scala, Python, or R shell, or through a high-level notebook such as Zeppelin.
-
Use APIs to create a Spark application that runs interactively or in batch mode, using Scala, Python, R, or Java.
Because of a limitation in the way Scala compiles code, some
applications with nested definitions running in an interactive shell may
encounter a Task not serializable
exception. Cloudera
recommends submitting these applications.
To run applications distributed across a cluster, Spark requires a cluster manager. In CDP, Cloudera supports only the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. Spark Standalone is not supported.
To launch Spark applications on a cluster, you can use the
spark-submit
script in the /bin
directory on a gateway host. You can also use the API interactively by
launching an interactive shell for Scala (spark-shell
),
Python (pyspark
), or SparkR. Note that each interactive
shell automatically creates SparkContext
in a variable
called sc
, and SparkSession
in a
variable called spark
. For more information about
spark-submit
, see the Apache Spark documentation Submitting Applications.
Alternately, you can use Livy to submit and manage Spark applications on a cluster. Livy is a Spark service that allows local and remote applications to interact with Apache Spark over an open source REST interface. Livy offers additional multi-tenancy and security functionality.