Preparing to Index Sample Tweets with Cloudera Search

In this section of the Cloudera Search tutorial, you will create a collection for tweets. The remaining examples in the tutorial use the same collection, so make sure that you follow these instructions carefully.

Configuring Sentry for Tweet Collection

If you have enabled Apache Sentry for authorization, you must have UPDATE permission for the admin=collections object as well as the collection you are creating (cloudera_tutorial_tweets in this example). You can also use the wildcard (*) to grant permissions to create any collection.

For more information on configuring Sentry and granting permissions, see Configuring Sentry Authorization for Cloudera Search.

To grant your user account (jdoe in this example) the necessary permissions:

  1. Switch to the Sentry admin user (solr in this example) using kinit:
    kinit solr@EXAMPLE.COM
  2. Grant UPDATE privileges to the cloudera_tutorial_role role for the admin=collections object and cloudera_tutorial_tweets collection:
    solrctl sentry --grant-privilege cloudera_tutorial_role 'admin=collections->action=UPDATE'
    solrctl sentry --grant-privilege cloudera_tutorial_role 'collection=cloudera_tutorial_tweets->action=UPDATE'
    The cloudera_tutorial_role role was created in Configuring Sentry for Test Collection. For more information on the Sentry privilege model for Cloudera Search, see Authorization Privilege Model for Cloudera Search.

Create a Collection for Tweets

  1. On a host with Solr Server installed, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
    cat /etc/solr/conf/solr-env.sh
    export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr

    If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

  2. If you are using Kerberos, kinit as the user that has privileges to create the collection:
    kinit jdoe@EXAMPLE.COM

    Replace EXAMPLE.COM with your Kerberos realm name.

  3. Generate the configuration files for the collection, including the tweet-specific schema.xml:
    • Parcel-based Installation:
      solrctl instancedir --generate $HOME/cloudera_tutorial_tweets_config
      cp /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/cloudera_tutorial_tweets_config/conf
    • Package-based Installation:
      solrctl instancedir --generate $HOME/cloudera_tutorial_tweets_config
      cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml $HOME/cloudera_tutorial_tweets_config/conf
  4. If you are using Apache Sentry for authorization, overwrite solrconfig.xml with solrconfig.xml.secure. If you omit this step, Sentry authorization is not enabled for the collection:
    cp $HOME/cloudera_tutorial_tweets_config/conf/solrconfig.xml.secure $HOME/cloudera_tutorial_tweets_config/conf/solrconfig.xml
  5. Upload the configuration to ZooKeeper:
    solrctl instancedir --create cloudera_tutorial_tweets_config $HOME/cloudera_tutorial_tweets_config
  6. Create a new collection with two shards (specified by the -s parameter) using the named configuration (specified by the -c parameter):
    solrctl collection --create cloudera_tutorial_tweets -s 2 -c cloudera_tutorial_tweets_config
  7. Verify that the collection is live. Open the Solr admin web interface in a browser by accessing the following URL:
    • STLS/SSL Enabled: https://search01.example.com:8985/solr/#/~cloud
    • TLS/SSL Disabled: http://search01.example.com:8983/solr/#/~cloud
    If you have Kerberos enabled on your cluster, enter the credentials for the solr@EXAMPLE.COM principal when prompted. Replace search01.example.com with the name of any host running the Solr Server process. Look for the cloudera_tutorial_tweets collection to verify that it exists.
  8. Prepare the configuration for use with MapReduce:
    cp -r $HOME/cloudera_tutorial_tweets_config $HOME/cloudera_tutorial_tweets_mr_config

Copy Sample Tweets to HDFS

  1. Copy the provided sample tweets to HDFS. These tweets will be used to demonstrate the batch indexing capabilities of Cloudera Search:
    • Parcel-based Installation (Security Enabled):
      kinit hdfs@EXAMPLE.COM
      hdfs dfs -mkdir -p /user/jdoe
      hdfs dfs -chown jdoe:jdoe /user/jdoe
      kinit jdoe@EXAMPLE.COM
      hdfs dfs -mkdir -p /user/jdoe/indir
      hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro /user/jdoe/indir/
      hdfs dfs -ls /user/jdoe/indir
    • Parcel-based Installation (Security Disabled):
      sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      hdfs dfs -mkdir -p /user/jdoe/indir
      hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro /user/jdoe/indir/
      hdfs dfs -ls /user/jdoe/indir
    • Package-based Installation (Security Enabled):
      kinit hdfs@EXAMPLE.COM
      hdfs dfs -mkdir -p /user/jdoe
      hdfs dfs -chown jdoe:jdoe /user/jdoe
      kinit jdoe@EXAMPLE.COM
      hdfs dfs -mkdir -p /user/jdoe/indir
      hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro /user/jdoe/indir/
      hdfs dfs -ls /user/jdoe/indir
    • Package-based Installation (Security Disabled):
      sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
      sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
      hdfs dfs -mkdir -p /user/jdoe/indir
      hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro /user/jdoe/indir/
      hdfs dfs -ls /user/jdoe/indir
  2. Ensure that outdir is empty and exists in HDFS:
    hdfs dfs -rm -r -skipTrash /user/jdoe/outdir
    hdfs dfs -mkdir /user/jdoe/outdir
    hdfs dfs -ls /user/jdoe/outdir

The sample tweets are now in HDFS and ready to be indexed. Continue to Using MapReduce Batch Indexing to Index Sample Tweets to index the sample tweets or to Near Real Time (NRT) Indexing Tweets Using Flume to index live tweets from the Twitter public stream.