Run the spark-submit job
After you create an indexer.jar file, you need to run a spark-submit job on a Solr worker node to index your input file.
- You have prepared the
indexer.jar
file and it is available on your local machine. - You have sufficient rights to SSH into one of the cluster nodes.
- You have retrieved the keytab for your environment.
- SSH to one of the worker nodes in your Data Hub cluster.
-
Copy your keytab file to the working directory:
scp [***KEYTAB***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
For example:scp sampleuser.keytab sampleuser@1.1.1.1:/tmp
-
Create a JAAS file with the following content:
Replace sampleuser@EXAMPLE.COM with your user principal.Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false doNotPrompt=true debug=true keyTab="sampleuser.keytab" principal="sampleuser@EXAMPLE.COM"; };
-
Copy the indexer JAR file to the working directory:
scp [***INDEXER***].jar [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
For example:scp indexer-1.0-SNAPSHOT.jar sampleuser@1.1.1.1:/tmp
-
Copy the input CSV file to the working directory:
scp [***INPUT FILE***] [***USER***]@[***IP OF WORKER NODE***]:/[***PATH/TO/WORKING/DIRECTORY***]
For example:scp nyc_yellow_taxi_sample_1k.csv sampleuser@1.1.1.1:/tmp
-
Add the input file to HDFS:
hdfs dfs -put [***INPUT FILE***]
For example:hdfs dfs -put nyc_yellow_taxi_sample_1k.csv
-
Create a Solr configuration and a collection:
solrctl config --create [***CONFIG NAME***] [***BASE CONFIG***] -p immutable=false solrctl collection --create [***COLLECTION NAME***] -s [***NUMBER OF SHARDS***] -c [***COLLECTION CONFIG NAME***]
For example:solrctl config --create testConfig managedTemplate -p immutable=false solrctl collection --create testcollection -s 2 -c testConfig
-
Submit your spark job:
spark-submit --jars /opt/cloudera/parcels/CDH/jars/[***SPARK-SOLR-*-SHADED.JAR***] \ --files [***KEYTAB***],[***JAAS CONFIGURATION FILE***] --name [***SPARK JOB NAME***] \ --driver-java-options="-Djavax.net.ssl.trustStoreType=[***TRUSTSTORE TYPE***] \ -Djavax.net.ssl.trustStore=[***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***] \ -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp [***INDEXER.JAR***] csv -zkHost [***ZOOKEEPER ENSEMBLE***] \ -collection [***TARGET SOLR COLLECTION***] -csvPath [***INPUT CSV FILE***] \ -solrJaasAuthConfig=[***JAAS CONFIGURATION FILE***]
Replace
- [***SPARK-SOLR-*-SHADED.JAR***]
- with the name of the
shaded.jar
file under/opt/cloudera/parcels/CDH/jars/
- [***KEYTAB***]
- with the keytab file of your user
- [***JAAS CONFIGURATION FILE***]
- with the JAAS file you created
- [***SPARK JOB NAME***]
- with the name of the job you want to run
- [***TRUSTSTORE TYPE***]
- with the type of the truststore used. If you use the default
jks type, you do not need to specify
-Djavax.net.ssl.trustStoreType
. In every other case it is mandatory. - [***ABSOLUTE/PATH/TO/TRUSTSTORE/FILE***]
- with the absolute path to the truststore file
- [***INDEXER.JAR***]
- with the
indexer.jar
file you created - [***ZOOKEEPER ENSEMBLE***]
- with the address of the ZooKeeper ensemble used by the Solr cluster.
- [***TARGET SOLR COLLECTION***]
- with the name of the Solr collection you created
- [***INPUT CSV FILE***]
- with the name of the file that you want to index into the [***TARGET SOLR COLLECTION***]
For example:spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.spark3.7.2.18.0-33-shaded.jar \ --files sampleuser.keytab,jaas-client.conf --name spark-solr \ --driver-java-options="-Djavax.net.ssl.trustStoreType=bcfks \ -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks \ -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv \ -zkHost sampleuser-leader2.sampleuser.work:2181,sampleuser.work:2181,sampleuser-master7.work:2181/solr-dde \ -collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv