Ingest data into Apache Druid

When Apache Druid (incubating) ingests data, Druid indexes the data. You can use one of several methods to ingest and index the data.

Druid has multiple ways to ingest data:

Through Hadoop. Data for ingestion is on HDFS.
From a file.
From a Kafka stream. Data must be in a topic.

For more information, see about ingestion methods, see documentation on the druid.io web site (see link below).

In this task, you ingest data from a file. You create an index task specification in JSON and use HTTP to ingest the data.

Create the following index task specification and name it index_local.json.

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          }
        }
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "/usr/hdp/current/druid-overlord/quickstart/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000,
      "forceExtendableShardSpecs" : true
    }
  }
}

Ingest the data using the file you created.

curl -X 'POST' -H 'Content-Type:application/json' -d index_local.json http://localhost:8090/druid/indexer/v1/task

In Ambari, in Quick Links, click the Druid Overload Console.
You see your job running, and then successfully complete.