This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Indexing a File Containing Tweets with Flume SpoolingDirectorySource

SpoolingDirectorySource specifies a directory on a local disk that Flume monitors. Flume automatically transfers data from files in this directory to Solr. SpoolingDirectorySource sends data using a channel to a sink, in this case a SolrSink.

  1. Delete all existing documents in Solr:
    $ sudo /etc/init.d/flume-ng-agent stop
    $ solrctl collection --deletedocs collection3
  2. Comment out TwitterSource and HTTPSource in /etc/flume-ng/conf/flume.conf and instead uncomment SpoolingDirectorySource:
    # Comment out “agent.sources = httpSrc”
    # uncomment “agent.sources = spoolSrc”
  3. Delete any old spool directory and create a new spool directory:
    $ rm -fr /tmp/myspooldir
    $ sudo -u flume mkdir /tmp/myspooldir
  4. Restart the Flume Agent:
    $ sudo /etc/init.d/flume-ng-agent restart
  5. Send a file containing tweets to the SpoolingDirectorySource. Use the copy-then-atomic-move file system trick to ensure no partial files are ingested:
    $ sudo -u flume cp \
    /usr/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
    /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro
    $ sudo -u flume mv /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro \
    /tmp/myspooldir/sample-statuses-20120906-141433-medium.avro
  6. Check the log for status or errors.
    $ cat /var/log/flume-ng/flume.log
  7. Check the completion status.
    $ find /tmp/myspooldir

Use the Cloudera Search GUI. For example, for the localhost, use http://localhost:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true to verify that new tweets have been ingested into Solr.

Page generated September 3, 2015.