Building reusable modules in Apache Spark applications

Using existing Scala and Java classes inside the Spark shell requires an effective deployment procedure and dependency management. For simple and reliable reuse of Scala and Java classes and complete third-party libraries, you can use a module, which is a self-contained artifact created by Maven. This module can be shared by multiple users. This topic shows how to use Maven to create a module containing all dependencies.

Create a Maven Project

  1. Use Maven to generate the project directory:
    $ mvn archetype:generate -DgroupId=com.mycompany -DartifactId=mylibrary \
    -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Download and Deploy Third-Party Libraries

  1. Prepare a location for all third-party libraries that are not available through Maven Central but are required for the project:
    mkdir libs
    cd libs
  2. Download the required artifacts.
  3. Use Maven to deploy the library JAR.
  4. Add the library to the dependencies section of the POM file.
  5. Repeat steps 2-4 for each library. For example, to add the JIDT library:
    1. Download and decompress the zip file:
      curl http://lizier.me/joseph/software/jidt/download.php?file=infodynamics-dist-1.3.zip > infodynamics-dist-1.3.zip
      unzip infodynamics-dist-1.3.zip
    2. Deploy the library JAR:
      $ mvn deploy:deploy-file \
      -Durl=file:///$HOME/.m2/repository -Dfile=libs/infodynamics.jar \
      -DgroupId=org.jlizier.infodynamics -DartifactId=infodynamics -Dpackaging=jar -Dversion=1.3
    3. Add the library to the dependencies section of the POM file:
      <dependency>
        <groupId>org.jlizier.infodynamics</groupId>
        <artifactId>infodynamics</artifactId>
        <version>1.3</version>
      </dependency>
      
  6. Add the Maven assembly plug-in to the plugins section in the pom.xml file.
  7. Package the library JARs in a module:
    mvn clean package

Run and Test the Spark Module

  1. Run the Spark shell, providing the module JAR in the --jars option:
    spark-shell --jars target/mylibrary-1.0-SNAPSHOT-jar-with-dependencies.jar
  2. In the Environment tab of the Spark Web UI application (http://driver_host:4040/environment/), validate that the spark.jars property contains the library. For example:

  3. In the Spark shell, test that you can import some of the required Java classes from the third-party library. For example, if you use the JIDT library, import MatrixUtils:
    $ spark-shell
    ...
    scala> import infodynamics.utils.MatrixUtils;