Once you have created an indexer.jar file, you need to run a spark-submit job on a
Solr worker node to index your input file.
- You have prepared the
indexer.jar
file and it is available on
your local machine.
- A DDE Data Hub cluster (Tech Preview) is up and running.
- You have sufficient rights to SSH into one of the cluster nodes.
- Your user has a role assigned that provides 'write' rights on S3.
- You have retrieved the keytab for your environment.
-
SSH to one of the worker nodes in your Data Hub cluster.
-
Copy your keytab file to the working directory.
scp <keytab> <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
For example:
scp sampleuser.keytab sampleuser@1.1.1.1:/tmp
-
Create a JAAS file with the following content:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
doNotPrompt=true
debug=true
keyTab="sampleuser.keytab"
principal="sampleuser@EXAMPLE.COM";
};
Replace
sampleuser@EXAMPLE.COM with your user principal.
-
Copy the indexer JAR file to the working directory.
scp <indexer>.jar <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
For example:
scp indexer-1.0-SNAPSHOT.jar sampleuser@1.1.1.1:/tmp
-
Copy the input CSV file to the working directory:
scp <INPUT_FILE> <user>@<IP_OF_WORKER_NODE>:/<PATH/TO/WORKING/DIRECTORY>
For
example:
scp nyc_yellow_taxi_sample_1k.csv sampleuser@1.1.1.1:/tmp
-
Add the input file to HDFS:
hdfs dfs -put <INPUT_FILE>
For
example:
hdfs dfs -put nyc_yellow_taxi_sample_1k.csv
-
Create a Solr collection:
solrctl config --create <configName> <baseConfige> -p immutable=false
solrctl collection --create <collectionName> -s <numShards> -c <collectionConfName>
For
example:
solrctl config --create testConfig managedTemplate -p immutable=false
solrctl collection --create testcollection -s 2 -c testConfig
-
Submit your spark job:
spark-submit --jars /<PATH/TO/WORKING/DIRECTORY>/spark-solr-*-shaded.jar --files <KEYTAB>,<JAAS_CONF_FILE> --name <SPARK_JOB_NAME> --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<ABSOLUT/PATH/TO/TRUSTSTORE/FILE> -Djavax.net.ssl.trustStorePassword=" --driver-java-options="-Djavax.net.ssl.trustStore=<ABSOLUT/PATH/TO/TRUSTSTORE/FILE> -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp <INDEXER_JAR> csv -zkHost <ZOOKEEPER_ENSEMBLE> -collection <TARGET_SOLR_COLLECTION> -csvPath <INPUT_CSV_FILE> -solrJaasAuthConfig=<JAAS_CONF_FILE>
For
example:
spark-submit --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.2.2.0-218-shaded.jar --files sampleuser.keytab,jaas-client.conf --name spark-solr --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword=" --driver-java-options="-Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword=" --class com.lucidworks.spark.SparkApp indexer-1.0-SNAPSHOT.jar csv -zkHost sampleuser-leader2.sampleuser.work:2181,sampleuser.work:2181,sampleuser-master7.work:2181/solr-dde -collection testcollection -csvPath nyc_yellow_taxi_sample_1k.csv -solrJaasAuthConfig=jaas-client.conf