Preparing to Index Data with Cloudera Search

The tutorial provides examples that work with an environment established using a package-based installation. If you installed Cloudera Search using parcels, adjust file paths accordingly.

Complete the following steps to prepare for indexing example data with MapReduce or Flume:

Start a SolrCloud cluster containing two servers (this example uses two shards) as described in Deploying Cloudera Search. Stop and continue with the next step here after you verify the Runtime Solr Configuration.

Generate the configuration files for the collection, including the tweet specific schema.xml:

$ solrctl instancedir --generate $HOME/solr_configs3
$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
$HOME/solr_configs3/conf

Upload the instance directory to ZooKeeper:

$ solrctl instancedir --create collection3 $HOME/solr_configs3

Create the new collection:

$ solrctl collection --create collection3 -s 2

Verify the collection is live. For example, for the localhost, use http://localhost:8983/solr/#/~cloud.
Prepare the configuration for use with MapReduce:
```
$ cp -r $HOME/solr_configs3 $HOME/collection3
```

Locate input files suitable for indexing, and check that the directory exists. This example assumes you are running the following commands as $USER with access to HDFS.

$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
$ hadoop fs -mkdir -p /user/$USER/indir
$ hadoop fs -copyFromLocal \
/usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
/user/$USER/indir/
$ hadoop fs -ls /user/$USER/indir

Ensure that outdir exists in HDFS and is empty:

$ hadoop fs -rm -r -skipTrash /user/$USER/outdir
$ hadoop fs -mkdir /user/$USER/outdir
$ hadoop fs -ls /user/$USER/outdir

Collect HDFS/MapReduce configuration details by downloading them from Cloudera Manager or using /etc/hadoop, depending on your installation mechanism for the Hadoop cluster. This example uses the configuration in /etc/hadoop/conf.cloudera.mapreduce1. Substitute the correct Hadoop configuration path for your cluster.

Categories: Flume | MapReduce | Search | All Categories

Validating the Deployment with the Solr REST API

Using MapReduce Batch Indexing with Cloudera Search