Indexing a File Containing Tweets with Flume HTTPSource

The tutorial provides examples that work with an environment established using a package-based installation. If you installed Cloudera Search using parcels, adjust file paths accordingly.

HTTPSource lets you ingest data into Solr by POSTing a file using HTTP. HTTPSource sends data using a channel to a sink, in this case a SolrSink. For more information, see Flume Solr BlobHandler Configuration Options.

Delete all existing documents in Solr:

$ sudo /etc/init.d/flume-ng-agent stop
$ solrctl collection --deletedocs collection3

Comment out TwitterSource in /etc/flume-ng/conf/flume.conf and uncomment HTTPSource:

# comment out “agent.sources = twitterSrc”
# uncomment “agent.sources = httpSrc”

Restart the Flume Agent:

$ sudo /etc/init.d/flume-ng-agent restart

Send a file containing tweets to the HTTPSource:

$ curl --data-binary \
@/usr/share/doc/search-1.0.0+cdh5.2.8+0/examples/test-documents/sample-statuses-20120906-141433-medium.avro \
'http://127.0.0.1:5140?resourceName=sample-statuses-20120906-141433-medium.avro' \
--header 'Content-Type:application/octet-stream' --verbose

Check the log for status or errors:
```
$ cat /var/log/flume-ng/flume.log 
```

Use the Cloudera Search GUI at http://localhost:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true to verify that new tweets have been ingested into Solr as expected.

Categories: ETL | Flume | Ingest | Search | All Categories

Starting the Flume Agent

Indexing a File Containing Tweets with Flume SpoolDirectorySource