Building reusable modules in Apache Spark applications
Using existing Scala and Java classes inside the Spark shell requires an effective deployment procedure and dependency management. For simple and reliable reuse of Scala and Java classes and complete third-party libraries, you can use a module, which is a self-contained artifact created by Maven. This module can be shared by multiple users. This topic shows how to use Maven to create a module containing all dependencies.
Create a Maven Project
-
Use Maven to generate the project directory:
$ mvn archetype:generate -DgroupId=com.mycompany -DartifactId=mylibrary \ -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Download and Deploy Third-Party Libraries
-
Prepare a location for all third-party libraries that are not available through
Maven
Central but are required for the project:
mkdir libs cd libs
- Download the required artifacts.
- Use Maven to deploy the library JAR.
- Add the library to the dependencies section of the POM file.
-
Repeat steps 2-4 for each library. For example, to add the
JIDT
library:
- Download and decompress the zip file:
curl http://lizier.me/joseph/software/jidt/download.php?file=infodynamics-dist-1.3.zip > infodynamics-dist-1.3.zip unzip infodynamics-dist-1.3.zip
- Deploy the library JAR:
$ mvn deploy:deploy-file \ -Durl=file:///$HOME/.m2/repository -Dfile=libs/infodynamics.jar \ -DgroupId=org.jlizier.infodynamics -DartifactId=infodynamics -Dpackaging=jar -Dversion=1.3
- Add the library to the dependencies section of the POM
file:
<dependency> <groupId>org.jlizier.infodynamics</groupId> <artifactId>infodynamics</artifactId> <version>1.3</version> </dependency>
- Download and decompress the zip file:
- Add the Maven assembly plug-in to the
plugins
section in thepom.xml
file. -
Package the library JARs in a module:
mvn clean package
Run and Test the Spark Module
-
Run the Spark shell, providing the module JAR in the
--jars
option:spark-shell --jars target/mylibrary-1.0-SNAPSHOT-jar-with-dependencies.jar
- In the Environment tab of the Spark Web UI application
(http://driver_host:4040/environment/), validate
that the
spark.jars
property contains the library. For example:
-
In the Spark shell, test that you can import some of the required Java classes from
the third-party library. For example, if you use the JIDT library, import
MatrixUtils
:$ spark-shell ... scala> import infodynamics.utils.MatrixUtils;