Cloudera Glossary

This is a reference list of terms that arise frequently in discussions of Cloudera products and services. Additional information is available from a number of resources.

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | W | Y | Z

Accumulo

See Apache Accumulo.

Apache Accumulo

A sorted, distributed key-value store based on Google's BigTable design. Accumulo is a NoSQL DBMS that operates over HDFS, and supports efficient storage and retrieval of structured data, including queries for ranges. Accumulo tables can be used as input and output for MapReduce jobs. Accumulo features include automatic load-balancing and partitioning, data compression, and fine-grained security labels.

Apache Avro

A serialization system for storing and transmitting data over a network. Avro supports rich data structures, a compact binary encoding, and a container file for sequences of Avro data (often referred to as "Avro data files"). Avro is language-independent and several language bindings are available for it, including Java, C, C++, Python, and Ruby. All components in CDH that produce or consume files support Avro data files as a file format.

Avro provides functionality similar to systems such as Apache Thrift and Protocol Buffers.

Apache Bigtop

A project to develop the packaging and interoperability testing of the Apache Hadoop ecosystem projects.

Apache Crunch

Java library that can be used to write, test, and run MapReduce pipelines. See Apache Crunch.

Apache Flume

A distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of text or streaming data from many different sources to a centralized data store.

Apache Giraph

A large-scale, fault-tolerant, graph processing framework that runs on Apache Hadoop.

Apache Hadoop

A free, open source software framework that supports data-intensive distributed applications. The core components of Apache Hadoop are the HDFS and the MapReduce and YARN processing frameworks. The term is also used for an ecosystem of projects related to Hadoop that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.

Apache HBase

A scalable, distributed, column-oriented data store. It provides real-time read/write random access to very large datasets hosted on HDFS.

Apache Hive

A data warehouse system for Hadoop that facilitates summarization and the analysis of large datasets stored in HDFS using an SQL-like language called HiveQL.

Apache Incubator

Apache Software Foundation gateway for open source projects that are aiming to become Apache projects. Projects that are incubating are open source but may or may not become Apache projects.

Apache Kafka

A distributed publish-subscribe messaging system. It provides high throughput for publishing and subscribing, as well as replication to prevent data loss.

Apache Mahout

A machine-learning library for Hadoop. It enables you to build machine-learning libraries that are scalable to large datasets, thus simplifying the task of building intelligent applications. The main use cases supported by Mahout are:

  • Recommendation mining - identifies things users will like based on past preferences; for example, online shopping recommendations.
  • Clustering - groups similar items; for example, documents that address similar topics
  • Classification - learns what members of existing categories have in common then uses that information to categorize new items.
  • Frequent item-set mining - takes a set of item-groups (such as items in a query session or shopping cart content) and identifies items that usually appear together.

Apache Maven

A software project management tool. Based on the concept of a project object model, Maven can manage a project's build, reporting, and documentation. CDH artifacts are available in the Cloudera Maven repository.

Apache Oozie

A workflow and coordination service to orchestrate data ingest, store, transform, and analysis actions.

Apache Pig

A data flow language and parallel execution framework that is built on top of MapReduce. Internally, a compiler translates Pig statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.

Apache Sentry

The next step in enterprise-grade big data security and delivers fine-grained authorization to data stored in Apache Hadoop. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise data sets.

Apache Software Foundation (ASF)

A non-profit corporation that supports various open source software products, including Apache Hadoop and related projects on which Cloudera products are based. Apache projects are developed by teams of collaborators and protected by an ASF license that provides legal protection to volunteers who work on Apache products and protect the Apache brand name.

Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Each project is managed by a self-selected team of technical experts who are active contributors to the project.

Cloudera employees are major contributors to many Apache projects.

Apache Spark

A general framework for distributed computing that offers exceptional performance for both iterative and interactive processing. Spark exposes user-friendly APIs for Java, Scala, and Python.

Apache Sqoop

A tool for efficiently transferring bulk data between Hadoop and external structured data stores such as relational databases. Sqoop imports the contents of tables into HDFS, Apache Hive, and Apache HBase and generates Java classes that enable users to interpret the table's schema. Sqoop can also extract data from Hadoop storage and export records from HDFS to external structured data stores such as relational databases and enterprise data warehouses.

There are two versions of Sqoop: Sqoop and Sqoop 2. Sqoop requires client-side installation and configuration. Sqoop 2 is a web-based service with a client command-line interface. With Sqoop 2 connectors and database drivers are configured on the server.

Apache Thrift

An interface definition language, runtime library, and a code generation engine to build services that can be invoked from many languages. Thrift can be used for serialization and RPC, but within Hadoop is mainly used for RPC.

Apache Whirr

A set of libraries for running applications on cloud services. Whirr can be used to run CDH clusters on services such as Amazon Elastic Compute Cloud (Amazon EC2); a working cluster starts immediately when the appropriate command is issued – it is not necessary to install the CDH packages in the cloud or do any configuration first. This is ideal for running temporary Hadoop clusters as proof-of-concept or training exercises. The cluster and all its data can be destroyed with a single command when it is no longer needed.

Apache ZooKeeper

A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services.

authentication

The function of confirming the identity of a person or software program.

authorization

The function of specifying access rights to resources.

Avro

See Apache Avro.

Beeswax

A Hue application that enables you to perform queries on Apache Hive. You can create Hive tables, load data, run, and manage Hive queries.

big data

Data sets whose input/output velocity, variety of data structure, and volume is beyond the beyond the capabilities of systems which were designed assuming smaller data sets to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are an expanding target, currently ranging from terabytes to many petabytes of data in a single data set.

BigTable

A compressed, high performance, column-oriented data base built on Google File System (GFS). The BigTable design was the inspiration for Apache HBase and Apache Accumulo, but the implementation, unlike other Google projects such as Protocol Buffers, is proprietary.

Bigtop

See Apache Bigtop.

CDH

Cloudera Apache Hadoop distribution containing core Hadoop (HDFS, MapReduce, YARN) and the following related projects: Apache Avro, Apache Flume, Fuse-DFS, Apache HBase, Apache Hive, Hue,Cloudera Impala, Apache Mahout, Apache Oozie, Apache Pig, Cloudera Search, Apache Sentry,Apache Spark, Apache Sqoop, Apache Whirr, Apache ZooKeeper, DataFu, and Kite.

CDH is free, 100% open source, and is licensed under the Apache 2.0 license. CDH is supported on many Linux distributions.

Cloudera Enterprise

  • Basic Edition offers an enterprise-ready distribution of Apache Hadoop together with Cloudera Manager and other advanced management tools and technical support.
  • Flex Edition offers an enterprise-ready distribution of Apache Hadoop together with Cloudera Manager and other advanced management tools and technical support. In addition, the Flex Edition includes open source indemnification, an optional premium support extension for mission-critical environments, and your choice of one of the following advanced components:Apache Accumuloor Apache HBase for online NoSQL storage and applications; Cloudera Impala for interactive analytic SQL queries; Cloudera Search for interactive search; Cloudera Navigator for data management including data auditing, lineage, and discovery; or Apache Spark for interactive analytics and stream processing.
  • Data Hub Edition offers an enterprise-ready distribution of Apache Hadoop together with Cloudera Manager and other advanced management tools and technical support. In addition, the Data Hub Edition includes open source indemnification, an optional premium support extension for mission-critical environments, and unlimited use of all of the following advanced components:Apache Accumulo and Apache HBase for online NoSQL storage and applications; Cloudera Impala for interactive analytic SQL queries; Cloudera Search for interactive search; Cloudera Navigator for data management including data auditing, lineage, and discovery, and Apache Spark for interactive analytics and stream processing.

Cloudera Express

A free download that contains CDH, which includes an enterprise-ready distribution of Apache Hadoop, Apache HBase, Cloudera Impala, Cloudera Search, Apache Spark, and Cloudera Manager, which offers robust cluster management capabilities like automated deployment, centralized administration, monitoring, and diagnostic tools. Cloudera Express enables data-driven enterprises to evaluate Apache Hadoop.

Cloudera Impala

A service that enables real-time querying of data stored in HDFS or Apache HBase. It supports the same metadata and ODBC and JDBC drivers as Apache Hive and a query language based on the Hive Standard Query Language (HiveQL). To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is similar to those found in commercial parallel RDBMS.

Cloudera Manager

An end-to-end management application for CDH, Cloudera Impala, and Cloudera Search. Cloudera Manager enables administrators to easily and effectively provision, monitor, and manage Hadoop clusters and CDH installations. Cloudera Manager is available in two versions: Cloudera Express and Cloudera Enterprise.

Cloudera Navigator

The first fully integrated data management tool for the Apache Hadoop platform, Cloudera Navigator provides the following data governance capabilities:
  • Auditing features that allow administrators to configure, collect, and view audit events, to understand who accessed what data and how. Cloudera Navigator also allows administrators to generate reports that list the HDFS access permissions granted to groups.
  • Metadata management features that allow DBAs, data modelers, business analysts, and data scientists to search for, amend the properties of, and tag data objects .

Cloudera Search

The first fully integrated search tool for the Apache Hadoop platform, Cloudera Search integrates Apache Solr, including Apache Lucene, Apache SolrCloud, and Apache Tika with CDH. Cloudera Search includes additions that make searching more scalable, easy to use, and optimized for both near-real-time and batch-oriented indexing.

cluster, Hadoop

A set of computers or racks of computers that contains an HDFS file system and runs MapReduce and other processes on that data. A pseudo-distributed cluster is a CDH installation run on a single machine and useful for demonstrations and individual study.

compression

A mechanism to reduce the size of a file so it takes up less disk space for storage and consumes less network bandwidth when transferred. The common compression tools used with Apache Hadoop include gzip, bzip2, Snappy, and LZO.

connector

Usually refers to software for connecting external systems with Apache Hadoop. Some connectors work with Apache Sqoop to enable efficient data transfer between an external system and Hadoop. Other connectors translate ODBC driver calls from business intelligence systems into HiveQL queries.

The JDBC drivers supported by Cloudera Manager are also referred to as connectors.

Crunch

See Apache Crunch.

data science

A discipline that builds on techniques and theories from many fields, including mathematics, statistics, and computer science with the goal of extracting meaning from data and creating data products.

DataFu

A collection of Apache Pig user-defined functions (UDFs) for statistical analysis.

data store

A repository of a set of integrated information objects. Data stores include repositories such as databases and files.

DDL

A category of SQL statements that affect database state rather than table data. Includes all the CREATE, ALTER, and DROP statements.

distributed computing

A field of computer science that studies distributed systems.

distributed system

A system composed of multiple autonomous computers that communicate through a computer network.

DML

A category of SQL statements that change table data, such as INSERT and LOAD DATA.

Enterprise Data Hub

An enterprise data hub (EDH), built on Apache Hadoop, provides a single central system for the storage and management of all data in the enterprise. An EDH runs the full range of workloads that enterprises require, including batch processing, interactive SQL, enterprise search, and advanced analytics, together with the integrations to existing systems, robust security, governance, and data protection.

Extract, Load, Transform (ELT)

See Extract, Transform, Load (ETL).

Extract, Transform, Load (ETL)

A process that involves extracting data from sources, transforming the data to fit operational needs, and loading the data into the end target, typically a database or data warehouse.

fault tolerant design

A design that enables a system to continue operation, possibly at a reduced level rather than failing completely, when some part of the system fails.

Flume

See Apache Flume.

Fuse-DFS

A service that allows HDFS to be mounted on Linux and accessed using standard filesystem tools.

Giraph

See Apache Giraph.

Hadoop

See Apache Hadoop.

Hadoop Distributed File System (HDFS)

A user space filesystem designed for storing very large files with streaming data access patterns, running on clusters of industry-standard machines. HDFS defines three components:

  • NameNode - maintains the namespace tree for HDFS and a mapping of file blocks to DataNodes where the data is stored. A simple HDFS cluster can have only one primary NameNode, supported by a secondary NameNode that periodically compresses the NameNode edits log file that contains a list of HDFS metadata modifications. This reduces the amount of disk space consumed by the log file on the NameNode, which also reduces the restart time for the primary NameNode. A high availability cluster contains two NameNodes: active and standby.
  • DataNode - stores data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data.
  • JournalNode - maintains a directory to log the modifications to the namespace metadata when using the Quorum-based Storage mechanism for providing High Availability. In the event of a failover, the NameNode standby will ensure that it has applied all of the edits from the JournalNodes before promoting itself to the active state.

Hadoop User Group (HUG)

A club focused on the use of Hadoop technology.

Hadoop World

An industry conference for Apache Hadoop users, contributors, administrators, and application developers.

HBase

See Apache HBase.

HBaseCon

An industry conference for Apache HBase users, contributors, administrators, and application developers.

high availability (HA)

A system and implementation design to keep a service available at all times in the face of failures, without regard to its performance.

Hive

See Apache Hive.

HiveServer

A server process that supports clients that connect to Hive over an Apache Thrift connection. The name also refers to a Thrift protocol used by both Impala and Hive.

HiveServer2

A server process that supports clients that connect to Hive over a network connection. These clients may be native command-line editors or applications and tools that use an ODBC or JDBC driver. The name also refers to a Thrift protocol used by both Impala and Hive.

HiveQL

The name of the SQL dialect used by the Hive component. It uses a syntax that is similar to standard SQL to execute MapReduce jobs on HDFS. HiveQL does not support all SQL functionality. Transactions and materialized views are not supported and support for indexes and subquery is limited. It supports features that are not part of standard SQL, such as multitable, including multitable inserts and create table as select.

Internally, a compiler translates HiveQL statement into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution. Beeswax, which is included in Hue, provides a graphical front-end for HiveQL queries.

Hue

A platform for building custom GUI applications for CDH services and a tool containing the following built-in applications: an application for submitting jobs, Apache Pig, Apache HBase, and Sqoop 2 shells, Apache Pig Editor, the Beeswax Hive UI, Cloudera Impala query editor, Solr Search applicationHive metastore manager, Oozie application editor, scheduler, and submitter, Apache HBase Browser, Sqoop 2 application, HDFS file manager, and MapReduce and YARN job browser.

Impala

See Cloudera Impala.

Incubator

See Apache Incubator.

index

A data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

JDBC driver

A client-side adapter that implements the JDBC Java programming language API for accessing relational database management systems.

JobTracker

See MapReduce v1 (MRv1).

Kerberos

A computer network authentication protocol that works on the basis of "tickets" to allow nodes communicating over an insecure network to prove their identity to one another in a secure manner. See the CDH 4 Security Guide and Cloudera Security for information on which components support Kerberos.

Kite

A collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with CDH. Just like CDH, Kite is 100% free, open source, and licensed under the Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

latency

A measure of time delay experienced in a system.

Linux

A Unix-like computer operating system assembled under the model of free, open source software development and distribution. Linux is a leading operating system on servers, mainframe computers, supercomputers, and embedded systems such as mobile phones, tablets, network routers, televisions, and video game consoles. The major distributions of enterprise Linux are CentOS, Debian, RHEL, SLES, and Ubuntu.

LZO

A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Further, unlike some other formats, LZO compressed files are splittable, enabling MapReduce to process splits in parallel.

LZO is published under the GNU General Public License and so is not included in CDH but can be used with CDH components; the Cloudera public Git repository hosts the hadoop-lzo package that provides a version of LZO that can be used with CDH.

Mahout

See Apache Mahout.

MapReduce

A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters of industry-standard machines.

The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

A MapReduce job partitions the input data set into independent chunks which are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.

The implementation provides an API for configuring and submitting jobs and job scheduling and management services, a library of search, sort, index, inverted index, and word co-occurrence algorithms, and the runtime. The runtime system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication.

MapReduce v1 (MRv1)

The runtime framework on which MapReduce jobs execute. It defines two daemons:

  • JobTracker - coordinates running MapReduce jobs and provides resource management and job life-cycle management. In YARN, those functions are performed by two separate components.
  • TaskTracker - runs the tasks that the MapReduce jobs have been split into.

MapReduce v2 (MRv2)

See YARN.

Maven

See Apache Maven.

Navigator

See Cloudera Navigator.

ODBC driver

A client-side adapter that implements a standard C programming language API for accessing relational database management systems.

Oozie

See Apache Oozie.

Oryx

An open-source project that provides a simple, real-time, large-scale machine learning and predictive analytics infrastructure. Oryx implements algorithms commonly used in business applications: collaborative filtering and recommendation, classification and regression, and clustering. Using Apache Hadoop, it can continuously build models from a data stream. It also serves queries of those models in real time through an HTTP REST API, and can update models based on new streaming data.

Parquet

An open source, column-oriented binary file format for Hadoop that supports very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

petabyte

1015 bytes. 1,000 terabytes or 1,000,000 gigabytes.

Pig

See Apache Pig.

Quorum-based Storage

A mechanism for enabling a standby NameNode to keep its state synchronized with the active NameNode, in which both nodes communicate with a group of daemons called JournalNodes.

RegionServer

In Apache HBase, applications store data into labeled tables, which are partitioned horizontally into regions. RegionServer is responsible for managing one or more regions.

relational database management system (RDBMS)

A database management system based on the relational model. In the relational model, all data is represented in terms of tuples, grouped into relations. Most implementations of the relational model use the SQL data definition and query language.

Sentry

See Apache Sentry.

serialization

The process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection). Deserialization is the process of converting it back to the original state later in the same or another computer environment. See Apache Avro and Apache Thrift.

service

Used in two senses in Cloudera documentation:
  • Linux - runs a System V init script in /etc/init.d/ in as predictable an environment as possible, removing most environment variables and setting the current working directory to /.
  • Cloudera Manager - See service.

Snappy

A compression library. Snappy aims for very high speeds and reasonable compression rather than maximum compression or compatibility with other compression libraries. Snappy is provided in the Hadoop package along with the other native libraries (such as native gzip compression).

Spark

See Apache Spark.

SQL

A declarative programming language designed for managing data in relational database management systems. It includes features for creating schema objects such as databases and tables, and for querying and modifying data. CDH includes SQL support through Impala for high-performance interactive queries, and Hive for long-running batch-oriented jobs.

Sqoop

See Apache Sqoop.

TaskTracker

See MapReduce v1 (MRv1).

terabyte

1012 bytes. 1,000 gigabytes.

Thrift

See Apache Thrift.

Whirr

See Apache Whirr.

YARN (Yet Another Resource Negotiator)

A general architecture for running distributed applications. YARN specifies the following components:

  • ResourceManager - manages the global assignment of compute resources to applications.
  • ApplicationMaster - manages the life cycle of applications
  • NodeManager - launches and monitors the compute containers on machines in the cluster
  • JobHistory Server - keeps track of completed applications

The application master negotiates with the resource manager for cluster resources – described in terms of a number of containers, each with a certain memory limit – and then runs application-specific processes in those containers. The containers are overseen by node managers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.

MapReduce v2 (MRv2) is implemented as a YARN application.

ZooKeeper

See Apache ZooKeeper.

Resources

Documentation

Books