Overview of Cloudera and the Cloudera Documentation Set

Cloudera Enterprise is a modern platform for machine learning and analytics, optimized for the cloud to be:

  • Unified

    Bring your data warehouse, data science, data engineering, and operational database workloads together on a single integrated platform. The Cloudera Shared Data Experience (SDX) enables these diverse analytic processes to operate against a shared data catalog that preserves business context like security and governance policies and schema. This common services framework persists even in transient cloud environments and makes it easier for IT departments to set and enforce policies while enabling business access to self-service analytics.

  • Hybrid

    Work where and how it’s most convenient, affordable, and effective. Cloudera Enterprise can read directly from and write directly to cloud object stores like Amazon S3 (AWS S3) and Azure Data Lake Store (Microsoft ADLS) as well as on-premises storage environments, or HDFS and Kudu on IaaS (infrastructure as a service). This provides flexibility to work on the data that you want wherever it lives, with no copies and no moves. Cloudera Enterprise also provides the most popular data warehouse and machine learning engines that can run on any compute resource for ultimate deployment flexibility. Cloudera hybrid control means users can self-service by way of a PaaS (platform as a service) offering, or choose more options to configure and manage the platform by way of an IaaS offering, private cloud, or an on-premises deployment.

  • Enterprise-grade

    Cloudera Enterprise is where the scale and performance required for today’s modern data workloads meets the security and governance demanded by today’s IT departments. This modern platform makes it easy to bring more users -- thousands -- to petabytes of diverse data and provides industry-leading engines to process and query data, and develop and serve data models quickly. The platform also provides several layers of fine-grained security and complete audit capability that prevents unauthorized data access and demonstrates accountability for actions taken.

This Getting Started guide provides a general overview of Cloudera enterprise solutions and their documentation. The same set of integrated enterprise products and tools offered for on-premises deployments are also offered in the cloud with Cloudera Altus.

Data Warehouse

Cloudera’s modern Data Warehouse powers high-performance BI and data warehousing in both on-premises deployments and as a cloud service. Business users can explore and iterate on data quickly, run new reports and workloads, or access interactive dashboards without assistance from the IT department. In addition, IT can eliminate the inefficiencies of “data silos” by consolidating data marts into a scalable analytics platform to better meet business needs. With its open architecture, data can be accessed by more users and more tools, including data scientists and data engineers, providing more value at a lower cost.

Apache Impala Distributed interactive SQL query engine for BI and SQL analytics on data in cloud object stores (AWS S3, Microsoft ADLS), Apache Kudu (for updating data), or on HDFS.
Apache Hive on Spark Provides the fastest ETL/ELT at scale so you can prepare data for BI and reporting.
SQL Development Workbench (HUE) Supports thousands of SQL developers, running millions of queries each week.
Workload XM Provides unique insights on workloads to support predictable offloads, query analysis and optimizations, and efficient utilization of cluster resources.
Cloudera Navigator Enables trusted data discovery and exploration, and curation based on usage needs.

Data Science

Only Cloudera offers a modern enterprise platform, tools and expert guidance to help you unlock business value with machine learning and AI. Cloudera’s modern platform for machine learning and analytics, optimized for the cloud, lets you build and deploy AI solutions at scale, efficiently and securely, anywhere you want. Cloudera Fast Forward Labs expert guidance helps you realize your AI future, faster.

Cloudera Data Science Workbench (CDSW) Accelerate data science from research to production on a collaborative platform for machine learning and AI. CDSW provides on-demand access to runtimes for R, Python, and Scala, plus high-performance integration with Apache Spark with secure connectivity to CDH. For deep learning and other demanding data science techniques, CDSW supports GPU-accelerated computing, so data scientists can use deep learning frameworks like TensorFlow, Apache MXNet, Keras, and more.
Apache Spark Provides flexible, in-memory data processing, reliable stream processing, and rich machine learning tooling for Hadoop.
Cloudera Fast Forward Labs Cloudera Fast Forward Labs helps you design and execute your enterprise machine learning strategy, enabling rapid, practical application of emerging machine learning technologies to your business. In addition, Cloudera Professional Services offer proven delivery of scalable, production-grade machine learning systems.

Data Engineering

Cloudera Data Engineering supports the foundational workloads of your big data journey: Fast and flexible ETL data processing workloads, and workloads that train machine learning models at scale. These workloads can be deployed on-premises or in the cloud.

Apache Spark, Spark Streaming, Spark MLlib, Spark SQL, and Hive on Spark Cloudera offers a modern platform for fast, flexible data processing of batch, real-time, and streaming workloads. Utilizing Apache Spark, which ingests all data, performs analytics on it, and then writes out data to the disk in one operation, advanced processing jobs can be completed in times that are significantly faster than traditional technology.
Altus Data Engineering Cloudera Enterprise is the comprehensive platform for data science and data engineering in the public cloud whether users are launching multiple workloads in a multi-tenant environment or designing jobs that leverage cloud infrastructure for specific job like ETL and exploratory data science.
Workload XM Provides unique insights on workloads to support predictable offloads, query analysis and optimizations, and efficient utilization of cluster resources.
Cloudera Navigator Provides governance and data management, including auditing, lineage, discovery, and policy lifecycle management.
Cloudera Navigator Encrypt & Key Trustee Provides transparent encryption of data at rest without requiring changes to your application and advanced key management.
HDFS, YARN, MapReduce, Hive, Pig, HUE, Sentry, Flume, Sqoop, Oozie, Kafka, Cloudera Manager, and Cloudera Altus Director Provides the basic Hadoop platform, management tools, and cloud deployment tools that supports data engineering workloads on-premises and in the cloud.

Operational Database

Cloudera’s operational database delivers a secure, low-latency, high-concurrency experience that can extract the insights in real-time that you need from constantly changing data. Operational database brings together and processes more data of all types from more sources, including IoT, to drive business insights within a single platform designed for web scale. Real-time, batch, and interactive processing frameworks give developers a variety of tools to ensure they deliver the value your business is looking for. As data sets, data-driven applications, and data users grow, Cloudera’s operational database offers linear scalability in performance at a manageable cost.

Apache Spark Provides flexible, in-memory data processing, reliable stream processing, and rich machine learning tooling for Hadoop.
Apache Kudu Kudu is Hadoop-native storage for fast analytics on fast data. It complements the capabilities of HDFS and HBase by providing a simplified architecture for building real-time analytic applications. It is designed to take advantage of next-generation hardware developments from Intel for even faster analytic performance. Combined with Apache Impala, they provide a high-performance analytic database solution; however, Kudu integrates with other frameworks within Cloudera Enterprise.
Apache HBase Provides a high performance, NoSQL database built on Hadoop. Similar to HDFS, it offers flexible data storage to store any type of data in any format. HBase is designed for fast, random read/write access and can be used for real-time data serving when you have many users who need low-latency read/write capabilities. It can also be used for real-time data capture and analysis due to its semi-structured row format, high performance, and its ability to store all raw and refined data. Finally, since HBase is an integrated part of the Cloudera Enterprise platform, you can manage it with Cloudera Manager and it includes security features (including table, column, and cell-level security) that make it compliance-ready.

Run Everything in the Cloud, Multi-Cloud, or on a Hybrid "Cloud / On-Premises" Deployment

Public clouds present a compelling opportunity to make analytics more agile and self-service. However, to reduce risk and costs, it makes sense to pursue hybrid- and multi-cloud environments. Cloudera Enterprise complements public cloud services and preserves your ability to pick and choose. Our solutions offer easy job-focused features and enterprise-grade qualities like unified security and governance. In addition, our cloud solutions efficiently deliver machine learning and analytic capabilities that you can use to leverage the power of your data.

Cloudera Altus and Cloudera Data Engineering Provides a platform-as-a-service (PaaS) for machine learning and analytics on Amazon Web Services and Microsoft Azure. Targets foundational data processing jobs like ETL and pipeline development, enabling data engineers to focus on their jobs while removing the burden of cluster management.
Cloudera Altus Director Provision and manage cloud environments for Data Engineering, Data Warehouse, Operational Database, or run CDSW in the cloud. The Cloudera Shared Data Experience provides unified and persistent controls for the data catalog, governance, and security both on-premises and in multiple clouds.

Documentation Overview

The following guides are included in the Cloudera enterprise documentation set:

Guide Description
Getting Started Provides an introduction to Cloudera solutions and their associated documentation. Also includes a section describing how to create a Proof-of-Concept Installation where you can test applications before you deploy.
Enterprise Release Guide Comprehensive release notes, requirements, supported versions, packaging and download information, and deprecated items for the Cloudera enterprise solutions.
Installation Provides instructions for installing Cloudera software, including Cloudera Manager, CDH, and other managed services.
Upgrade Provides a complete upgrade guide for upgrading CDH and all the supporting platform software, such as the operating system, the JDK, and underlying databases.
Cluster Management Describes how to configure and manage clusters in a Cloudera enterprise deployment using Cloudera Manager. In addition, this guide shows you how to use Cloudera Manager to monitor the health of your Cloudera deployment, diagnose issues as they occur, and use/view logs and reports to troubleshoot issues related to configuration, operation, and compliance.
Security Provides information about securing your cluster by using data encryption, user authentication, and authorization techniques.
Governance and Metadata Management Provides information about using Cloudera Navigator Data Management for comprehensive data governance, compliance, data stewardship, and other data management tasks.
Component Guides

Provides how-to and best practice information for the CDH components: