1. Understand the Basics

The Hortonworks Data Platform consists of three layers.

  • Core Hadoop:The basic components of Apache Hadoop.

    • Hadoop Distributed File System (HDFS): A special purpose file system that is designed to work with the MapReduce engine. It provides high-throughput access to data in a highly distributed environment.

    • Apache Hadoop YARN: YARN is a general-purpose, distributed, application man­agement framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager  and per-application  ApplicationMaster  (AM). The ResourceManager and per-node slave, the  NodeManager  (NM), form the new, and generic, system for managing applications in a distributed man­ner. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the Resource­Manager and working with the NodeManager(s) to execute and monitor the component tasks.

    • MapReduce: A framework for performing high volume distributed data processing using the MapReduce programming paradigm.

  • Essential HadoopA set of Apache components designed to ease working with Core Hadoop.

    • Apache Pig: A platform for creating higher level data flow programs that can be compiled into sequences of MapReduce programs, using Pig Latin, the platform’s native language.

    • Apache Hive: A tool for creating higher level SQL-like queries using HiveQL, the tool’s native language, that can be compiled into sequences of MapReduce programs.

    • Tez: A general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop.

    • Apache HCatalog: A metadata abstraction layer that insulates users and scripts from how and where data is physically stored.

    • Apache HBase: A distributed, column-oriented database that provides the ability to access and manipulate data randomly in the context of the large blocks that make up HDFS.

    • Apache ZooKeeper:A centralized tool for providing services to highly distributed systems. ZooKeeper is necessary for HBase installations.

  • Supporting ComponentsA set of Apache components designed to ease working with Core Hadoop.

    • Apache Oozie:A server based workflow engine optimized for running workflows that execute Hadoop jobs.

    • Apache Sqoop: A component that provides a mechanism for moving data between HDFS and external structured datastores. Can be integrated with Oozie workflows.

You must always install Core Hadoop, but you can select the components from the other layers based on your needs. For more information on the structure of the HDP, see Understanding Hadoop Ecosystem.

loading table of contents...