Run the spark-submit job
After you create an indexer.jar file, you need to run a spark-submit job on a Solr worker node to index your input file.
- You have prepared the
file and it is available on your local machine. - A DDE Data Hub cluster (Tech Preview) is up and running.
- You have sufficient rights to SSH into one of the cluster nodes.
- Your user has a role assigned that provides 'write' rights on S3.
- You have retrieved the keytab for your environment.
- SSH to one of the worker nodes in your Data Hub cluster.
Copy your keytab file to the working directory:
For example:scp sampleuser.keytab sampleuser@
Create a JAAS file with the following content:
Client { required useKeyTab=true useTicketCache=false doNotPrompt=true debug=true keyTab="sampleuser.keytab" principal="sampleuser@EXAMPLE.COM"; };
Replace sampleuser@EXAMPLE.COM with your user principal. -
Copy the indexer JAR file to the working directory:
scp <indexer>.jar <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
For example:scp indexer-1.0-SNAPSHOT.jar sampleuser@
Copy the input CSV file to the working directory:
For example:scp nyc_yellow_taxi_sample_1k.csv sampleuser@
Add the input file to HDFS:
hdfs dfs -put <INPUT_FILE>
For example:hdfs dfs -put nyc_yellow_taxi_sample_1k.csv
Create a Solr collection:
solrctl config --create <configName> <baseConfige> -p immutable=false
solrctl collection --create <collectionName> -s <numShards> -c <collectionConfName>
For example:solrctl config --create testConfig managedTemplate -p immutable=false solrctl collection --create testcollection -s 2 -c testConfig
Submit your spark job:
spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-*-shaded.jar --files <KEYTAB>,<JAAS_CONF_FILE> --name <SPARK_JOB_NAME> --driver-java-options="<ABSOLUTE/PATH/TO/TRUSTSTORE/FILE>" --class com.lucidworks.spark.SparkApp <INDEXER_JAR> csv -zkHost <ZOOKEEPER_ENSEMBLE> -collection <TARGET_SOLR_COLLECTION> -csvPath <INPUT_CSV_FILE> -solrJaasAuthConfig=<JAAS_CONF_FILE>
Replace- spark-solr-*-shaded.jar
- with the name of the
file under/opt/cloudera/parcels/CDH/jars/
- with the keytab file of your user
- with the JAAS file you created
- with the name of the job you want to run
- with the absolute path to the truststore file
- with the
file you created - <ZOOKEEPER_ENSEMBLE>
- with the address of the ZooKeeper ensemble used by the Solr cluster.
- with the name of the Solr collection you created
- with the name of the file that you want to index into the <TARGET_SOLR_COLLECTION>
For example:spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr- --files sampleuser.keytab,jaas-client.conf --name spark-solr --driver-java-options="" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv -zkHost,, -collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv -solrJaasAuthConfig=jaas-client.conf