Run the spark-submit job

After you create an indexer.jar file, you need to run a spark-submit job on a Solr worker node to index your input file.

You have prepared the indexer.jar file and it is available on your local machine.
You have sufficient rights to SSH into one of the cluster nodes.
You have retrieved the keytab for your environment.

SSH to one of the worker nodes in your Cloudera Data Hub cluster.

Copy your keytab file to the working directory:

scp [***KEYTAB***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]

For example:

scp sampleuser.keytab sampleuser@1.1.1.1:/tmp

Create a JAAS file with the following content:

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache=false
  doNotPrompt=true
  debug=true
  keyTab="sampleuser.keytab"
  principal="sampleuser@EXAMPLE.COM";
};

Replace sampleuser@EXAMPLE.COM with your user principal.

Copy the indexer JAR file to the working directory:

scp [***INDEXER***].jar [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]

For example:

scp indexer-1.0-SNAPSHOT.jar sampleuser@1.1.1.1:/tmp

Copy the input CSV file to the working directory:

scp [***INPUT FILE***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]

For example:

scp nyc_yellow_taxi_sample_1k.csv sampleuser@1.1.1.1:/tmp

Add the input file to HDFS:

hdfs dfs -put [***INPUT FILE***]

For example:

hdfs dfs -put nyc_yellow_taxi_sample_1k.csv

Create a Solr configuration and a collection:

solrctl config --create [***CONFIG NAME***] [***BASE CONFIG***] -p immutable=false
solrctl collection --create [***COLLECTION NAME***] -s [***NUMBER OF SHARDS***] -c [***COLLECTION CONFIG NAME***]

For example:

solrctl config --create testConfig managedTemplate -p immutable=false
solrctl collection --create testcollection -s 2 -c testConfig

Submit your spark job:

spark-submit --jars /opt/cloudera/parcels/CDH/jars/[***SPARK-SOLR-*-SHADED.JAR***] \
--files [***KEYTAB***],[***JAAS CONFIGURATION FILE***] --name [***SPARK JOB NAME***] \
--driver-java-options="-Djavax.net.ssl.trustStoreType=[***TRUSTSTORE TYPE***] \
-Djavax.net.ssl.trustStore=[***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***] \
-Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp [***INDEXER.JAR***] csv -zkHost [***ZOOKEEPER ENSEMBLE***] \
-collection [***TARGET SOLR COLLECTION***] -csvPath [***INPUT CSV FILE***] \
-solrJaasAuthConfig=[***JAAS CONFIGURATION FILE***]

note

To use Spark-Solr with an SSL-enabled ZooKeeper, you need to add extra Java options. Add the following JVM parameters as both driver and executor java options:

-Dzookeeper.client.secure=true \ 
-Dzookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty \ 
-Dzookeeper.ssl.trustStore.location=[***TRUSTSTORE_LOCATION***] \ 
-Dzookeeper.ssl.trustStore.password=[***TRUSTSTORE_PASSWORD***] \

Replace [***TRUSTSTORE_LOCATION***] with the Java truststore location and [***TRUSTSTORE_PASSWORD***] with the Java truststore password.

Replace

[***SPARK-SOLR-*-SHADED.JAR***]: with the name of the shaded.jar file under /opt/cloudera/parcels/CDH/jars/
[***KEYTAB***]: with the keytab file of your user
[***JAAS CONFIGURATION FILE***]: with the JAAS file you created
[***SPARK JOB NAME***]: with the name of the job you want to run
[***TRUSTSTORE TYPE***]: with the type of the truststore used. If you use the default jks type, you do not need to specify -Djavax.net.ssl.trustStoreType. In every other case it is mandatory.
[***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***]: with the absolute path to the truststore file
[***INDEXER.JAR***]: with the indexer.jar file you created
[***ZOOKEEPER ENSEMBLE***]: with the address of the ZooKeeper ensemble used by the Solr cluster.
[***TARGET SOLR COLLECTION***]: with the name of the Solr collection you created
[***INPUT CSV FILE***]: with the name of the file that you want to index into the [***TARGET SOLR COLLECTION***]

For example:

spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.spark3.7.2.18.0-33-shaded.jar \
--files sampleuser.keytab,jaas-client.conf --name spark-solr \
--driver-java-options="-Djavax.net.ssl.trustStoreType=bcfks \
-Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks \
-Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv \
-zkHost sampleuser-leader2.sampleuser.work:2181,sampleuser.work:2181,sampleuser-master7.work:2181/solr-dde \
-collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv