Indexing a File Containing Tweets with Flume SpoolingDirectorySource
SpoolingDirectorySource specifies a directory on a local disk that Flume monitors. Flume automatically transfers data from files in this directory to Solr. SpoolingDirectorySource sends data using a channel to a sink, in this case a SolrSink.
- Delete all existing documents in
Solr:
$ sudo /etc/init.d/flume-ng-agent stop $ solrctl collection --deletedocs collection3
- Comment out TwitterSource and HTTPSource in
/etc/flume-ng/conf/flume.conf and instead uncomment
SpoolingDirectorySource:
# Comment out “agent.sources = httpSrc” # uncomment “agent.sources = spoolSrc”
- Delete any old spool directory and create a new spool
directory:
$ rm -fr /tmp/myspooldir $ sudo -u flume mkdir /tmp/myspooldir
- Restart the Flume
Agent:
$ sudo /etc/init.d/flume-ng-agent restart
- Send a file containing tweets to the SpoolingDirectorySource. Use the
copy-then-atomic-move file system trick to ensure no partial files are
ingested:
$ sudo -u flume cp \ /usr/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433-medium.avro \ /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro $ sudo -u flume mv /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro \ /tmp/myspooldir/sample-statuses-20120906-141433-medium.avro
- Check the log for status or
errors.
$ cat /var/log/flume-ng/flume.log
- Check the completion status.
$ find /tmp/myspooldir
Use the Cloudera Search GUI. For example, for the localhost, use http://localhost:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true to verify that new tweets have been ingested into Solr.
<< Indexing a File Containing Tweets with Flume HTTPSource | Using Hue with Cloudera Search >> | |