Run the spark-submit job

Once you have created an indexer.jar file, you need to run a spark-submit job on a Solr worker node to index your input file.

  • You have prepared the indexer.jar file and it is available on your local machine.
  • A DDE Data Hub cluster (Tech Preview) is up and running.
  • You have sufficient rights to SSH into one of the cluster nodes.
  • Your user has a role assigned that provides 'write' rights on S3.
  • You have retrieved the keytab for your environment.
  1. SSH to one of the worker nodes in your Data Hub cluster.
  2. Copy your keytab file to the working directory.
    scp <keytab> <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
    For example:
    scp sampleuser.keytab sampleuser@1.1.1.1:/tmp
  3. Create a JAAS file with the following content:
    Client {
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      useTicketCache=false
      doNotPrompt=true
      debug=true
      keyTab="sampleuser.keytab"
      principal="sampleuser@EXAMPLE.COM";
    };
    Replace sampleuser@EXAMPLE.COM with your user principal.
  4. Copy the indexer JAR file to the working directory.
    scp <indexer>.jar <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
    For example:
    scp indexer-1.0-SNAPSHOT.jar sampleuser@1.1.1.1:/tmp
  5. Copy the input CSV file to the working directory:
    scp <INPUT_FILE> <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
    For example:
    scp nyc_yellow_taxi_sample_1k.csv sampleuser@1.1.1.1:/tmp
  6. Add the input file to HDFS:
    hdfs dfs -put <INPUT_FILE>
    For example:
    hdfs dfs -put nyc_yellow_taxi_sample_1k.csv
  7. Create a Solr collection:
    solrctl config --create <configName> <baseConfige> -p immutable=false
    solrctl collection --create <collectionName> -s <numShards> -c <collectionConfName>
    For example:
    solrctl config --create testConfig managedTemplate -p immutable=false
    solrctl collection --create testcollection -s 2 -c testConfig
  8. Submit your spark job:
    spark-submit --jars /<PATH/TO/WORKING/DIRECTORY>/spark-solr-*-shaded.jar --files <KEYTAB>,<JAAS_CONF_FILE> --name <SPARK_JOB_NAME> --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<ABSOLUT/PATH/TO/TRUSTSTORE/FILE> -Djavax.net.ssl.trustStorePassword=" --driver-java-options="-Djavax.net.ssl.trustStore=<ABSOLUT/PATH/TO/TRUSTSTORE/FILE> -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp <INDEXER_JAR> csv -zkHost <ZOOKEEPER_ENSEMBLE> -collection <TARGET_SOLR_COLLECTION> -csvPath <INPUT_CSV_FILE> -solrJaasAuthConfig=<JAAS_CONF_FILE>
    For example:
    spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.2.2.0-218-shaded.jar --files sampleuser.keytab,jaas-client.conf --name spark-solr --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword=" --driver-java-options="-Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv -zkHost sampleuser-leader2.sampleuser.work:2181,sampleuser.work:2181,sampleuser-master7.work:2181/solr-dde -collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv -solrJaasAuthConfig=jaas-client.conf