1. HDP Components

The Hortonworks Data Platform consists of three layers:

Core Hadoop 2: The basic components of Apache Hadoop version 2.x.
- Hadoop Distributed File System (HDFS): A special purpose file system designed to provide high-throughput access to data in a highly distributed environment.
- YARN: A resource negotiator for managing high volume distributed data processing. Previously part of the first version of MapReduce.
- MapReduce 2 (MR2): A set of client libraries for computation using the MapReduce programming paradigm and a History Server for logging job and task information. Previously part of the first version of MapReduce.
Essential Hadoop: A set of Apache components designed to ease working with Core Hadoop.
- Apache Pig: A platform for creating higher level data flow programs that can be compiled into sequences of MapReduce programs, using Pig Latin, the platform’s native language.
- Apache Hive: A tool for creating higher level SQL-like queries using HiveQL, the tool’s native language, that can be compiled into sequences of MapReduce programs.
- Apache HCatalog: A metadata abstraction layer that insulates users and scripts from how and where data is physically stored.
- WebHCat (Templeton): A component that provides a set of REST-like APIs for HCatalog and related Hadoop components.
- Apache HBase: A distributed, column-oriented database that provides the ability to access and manipulate data randomly in the context of the large blocks that make up HDFS.
- Apache ZooKeeper: A centralized tool for providing services to highly distributed systems. ZooKeeper is necessary for HBase installations.
Supporting Components: A set of components that allow you to monitor your Hadoop installation and to connect Hadoop with your larger compute environment.
- Apache Oozie: A server based workflow engine optimized for running workflows that execute Hadoop jobs.
- Apache Sqoop: A component that provides a mechanism for moving data between HDFS and external structured datastores. Can be integrated with Oozie workflows.
- Apache Flume: A log aggregator. This component must be installed manually.
- Apache Mahout: A scalable machine learning library that implements several different approaches to machine learning.
- Apache Knox: A REST API gateway for interacting with Apache Hadoop clusters. The gateway provides a single access point for REST interactions with Hadoop clusters.
- Apache Storm: A distributed, real-time computation system for processing large volumes of data.
- Apache Spark: An in-memory data processing engine with access to development APIs to enable rapid execution of streaming, machine learning or SQL workloads requiring iterative access to datasets.
- Apache Phoenix: A relational database layer on top of Apache HBase.
- Apache Tez: An extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. For additional information, see the Hortonworks website.
- Apache Falcon: A framework for simplifying and orchestrating data management and pipeline processing in Apache Hadoop. For additional information, see the Hortonworks website.
- Apache Ranger: The Hadoop cluster security component. Range provides centralized security policy administration for authorization, auditing, and data protection requirements.
- Apache DataFu: A library for user defined functions for common data analysis task.
- Apache Slider: A YARN-based framework to deploy and manage long running or always-on data access applications.

Legal notices