Run the spark-submit job

After you create an indexer.jar file, you need to run a spark-submit job on a Solr worker node to index your input file.

  • You have prepared the indexer.jar file and it is available on your local machine.
  • A DDE Data Hub cluster is up and running.
  • You have sufficient rights to SSH into one of the cluster nodes.
  • Your user has a role assigned that provides 'write' rights on S3.
  • You have retrieved the keytab for your environment.
  1. SSH to one of the worker nodes in your Data Hub cluster.
  2. Copy your keytab file to the working directory:
    scp [***KEYTAB***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
    For example:
    scp sampleuser.keytab sampleuser@1.1.1.1:/tmp
  3. Create a JAAS file with the following content:
    Client {
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      useTicketCache=false
      doNotPrompt=true
      debug=true
      keyTab="sampleuser.keytab"
      principal="sampleuser@EXAMPLE.COM";
    };
    Replace sampleuser@EXAMPLE.COM with your user principal.
  4. Copy the indexer JAR file to the working directory:
    scp [***INDEXER***].jar [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
    For example:
    scp indexer-1.0-SNAPSHOT.jar sampleuser@1.1.1.1:/tmp
  5. Copy the input CSV file to the working directory:
    scp [***INPUT FILE***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
    For example:
    scp nyc_yellow_taxi_sample_1k.csv sampleuser@1.1.1.1:/tmp
  6. Add the input file to HDFS:
    hdfs dfs -put [***INPUT FILE***]
    For example:
    hdfs dfs -put nyc_yellow_taxi_sample_1k.csv
  7. Create a Solr configuration and a collection:
    solrctl config --create [***CONFIG NAME***] [***BASE CONFIG***] -p immutable=false
    solrctl collection --create [***COLLECTION NAME***] -s [***NUMBER OF SHARDS***] -c [***COLLECTION CONFIG NAME***]
    For example:
    solrctl config --create testConfig managedTemplate -p immutable=false
    solrctl collection --create testcollection -s 2 -c testConfig
  8. Submit your spark job:
    spark-submit --jars /opt/cloudera/parcels/CDH/jars/[***SPARK-SOLR-*-SHADED.JAR***] \
    --files [***KEYTAB***],[***JAAS CONFIGURATION FILE***] --name [***SPARK JOB NAME***] \
    --driver-java-options="-Djavax.net.ssl.trustStoreType=[***TRUSTSTORE TYPE***] \
    -Djavax.net.ssl.trustStore=[***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***] \
    -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp [***INDEXER.JAR***] csv -zkHost [***ZOOKEEPER ENSEMBLE***] \
    -collection [***TARGET SOLR COLLECTION***] -csvPath [***INPUT CSV FILE***] \
    -solrJaasAuthConfig=[***JAAS CONFIGURATION FILE***]

    Replace

    [***SPARK-SOLR-*-SHADED.JAR***]
    with the name of the shaded.jar file under /opt/cloudera/parcels/CDH/jars/
    [***KEYTAB***]
    with the keytab file of your user
    [***JAAS CONFIGURATION FILE***]
    with the JAAS file you created
    [***SPARK JOB NAME***]
    with the name of the job you want to run
    [***TRUSTSTORE TYPE***]
    with the type of the truststore used. If you use the default jks type, you do not need to specify -Djavax.net.ssl.trustStoreType. In every other case it is mandatory.
    [***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***]
    with the absolute path to the truststore file
    [***INDEXER.JAR***]
    with the indexer.jar file you created
    [***ZOOKEEPER ENSEMBLE***]
    with the address of the ZooKeeper ensemble used by the Solr cluster.
    [***TARGET SOLR COLLECTION***]
    with the name of the Solr collection you created
    [***INPUT CSV FILE***]
    with the name of the file that you want to index into the [***TARGET SOLR COLLECTION***]
    For example:
    spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.spark3.7.2.18.0-33-shaded.jar \
    --files sampleuser.keytab,jaas-client.conf --name spark-solr \
    --driver-java-options="-Djavax.net.ssl.trustStoreType=bcfks \
    -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks \
    -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv \
    -zkHost sampleuser-leader2.sampleuser.work:2181,sampleuser.work:2181,sampleuser-master7.work:2181/solr-dde \
    -collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv