Compiling and running a Java-based job

You see by example how to compile a Java-based Spark job using Maven.

In this task, you see how to compile the following example Spark program written in Java:
/* SimpleApp.java */
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
 
public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
    Dataset<String> logData = spark.read().textFile(logFile).cache();
 
    long numAs = logData.filter(s -> s.contains("a")).count();
    long numBs = logData.filter(s -> s.contains("b")).count();
 
   System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
 
    spark.stop();
  }
}              

You also need to create a Maven Project Object Model (POM) file, as shown in the following example:

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>
<dependencies>
    <dependency> <!-- Spark dependency -->
  	<groupId>org.apache.spark</groupId>
  	<artifactId>spark-sql_2.12</artifactId>
  	<version>2.4.0</version>
  	<scope>provided</scope>
    </dependency>
</dependencies>
</project>
  • Install Apache Spark 2.4.x, JDK 8.x, and maven
  • Write a Java Spark program .java file.
  • Write a pom.xml file. This is where your Scala code resides.
  • If the cluster is Kerberized, ensure the required security token is authorized to compile and execute the workload.
  1. Lay out these files according to the canonical Maven directory structure.
    For example:
    $ find .
    ./pom.xml
    ./src
    ./src/main
    ./src/main/java
    ./src/main/java/SimpleApp.java
  2. Package the application using maven package command.
    For example:
    # Package a JAR containing your application
    $ mvn package
    ...
    [INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar
    After compilation, several new files are created under new directories named project and target. Among these new files, is the jar file under the target directory to run the code. For example, the file is named simple-project-1.0.jar.
  3. Execute and test the workload jar using the spark submit command.
    For example:
    # Use spark-submit to run your application
    spark-submit \
    --class "SimpleApp" \
    --master yarn \
    target/simple-project-1.0.jar