Apache Druid architectural overview
Druid Architecture
Apache Druid (incubating) supports streaming ingestion and batch ingestion data analytics modes. A Druid cluster consists of several Druid node types and components. Each Druid node is optimized to serve particular functions. The following list is an overview of Druid node types:
- Realtime nodes
These nodes ingest and index streaming data that is generated by system events. The nodes construct segments from the data and store the segments until these segments are sent to historical nodes. The realtime nodes do not store segments after the segments are transferred.
- Historical nodes
These nodes are designed to serve queries over immutable, historical data. Historical nodes download immutable, read-optimized Druid segments from deep storage and use memory-mapped files to load them into available memory. Each historical node tracks the segments it has loaded in ZooKeeper and transmits this information to other nodes of the Druid cluster when needed.
- Broker nodes
These nodes form a gateway between external clients and various historical and realtime nodes. External clients send queries to broker nodes. The nodes then break each query into smaller queries based on the location of segments for the queried interval and forwards them to the appropriate historical or realtime nodes. Broker nodes merge query results and send them back to the client. These nodes can also be configured to use a local or distributed cache for caching query results for individual segments.
- Coordinator nodes
These nodes mainly serve to assign segments to historical nodes, handle data replication, and to ensure that segments are distributed evenly across the historical nodes. They also provide a UI to manage different datasources and configure rules to load data and drop data for individual datas sources. The UI can be accessed via Ambari Quick Links.
- Middle
manager nodes
These nodes are responsible for running various tasks related to data ingestion, realtime indexing, and segment archives. Each Druid task is run as a separate JVM.
- Overlord nodes
These nodes handle task management and maintain a task queue that consists of user-submitted tasks. The queue is processed by assigning tasks in order to the middle manager nodes, which actually run the tasks.The overlord nodes also support a UI that provides a view of the current task queue and access to task logs. The UI can be accessed via Ambari Quick Links for Druid.
To use Druid in a real-world environment, the cluster must have access to the following resources to make Druid operational in HDP:
- ZooKeeper:
A Druid instance requires that you select Apache ZooKeeper as a Service when you add Druid to the cluster; otherwise, Ambari does not add Druid to the cluster. ZooKeeper coordinates Druid nodes and manages elections among coordinator and overlord nodes.
- Deep storage:
HDFS or Amazon S3 can be used as the deep storage layer for Druid in HDP. In Ambari, you can select HDFS as a Service for this storage layer. Alternatively, you can set up Druid to use Amazon S3 as the deep storage layer by setting the
druid.storage.type
property tos3
. The cluster relies on the distributed file system to store Druid segments for permanent backup of the data. - Metadata storage:
The metadata store is used to persist information about Druid segments and tasks. MySQL and Postgres are supported metadata stores. You can select the metadata database when you install and configure Druid with Ambari.
- Batch execution engine:
Select YARN + MapReduce2 for the execution resource manager and execution engine, respectively. Druid Hadoop index tasks use MapReduce jobs for distributed ingestion of large amounts of data.
- (Optional) Druid metrics reporting:
If you plan to monitor Druid performance metrics using Grafana dashboards in Ambari, select Ambari Metrics System as a Service.
If you plan to deploy high availability (HA) on a Druid cluster, you need to know which components to install and how to configure the installation so that the Druid instance is primed for an HA environment.