Data Engineering clusters

Learn about the default Data Engineering clusters, including cluster definition and template names, included services, and compatible Runtime version.

Data Engineering provides a complete data processing solution, powered by Apache Spark and Apache Hive. Spark and Hive enable fast, scalable, fault-tolerant data engineering and analytics over petabytes of data.

Data Engineering cluster definition

This Data Engineering template includes a standalone deployment of Spark and Hive, as well as Apache Oozie for job scheduling and orchestration, Apache Livy for remote job submission, and Hue and Apache Zeppelin for job authoring and interactive analysis.

Cluster definition names
  • Data Engineering for AWS

  • Data Engineering for Azure

  • Data Engineering for GCP
  • Data Engineering HA for AWS

    See the architectural information below for the Data Engineering HA clusters

  • Data Engineering HA for Azure (Preview)

    See the architectural information below for the Data Engineering HA clusters

  • Data Engineering HA for GCP (Preview)
  • Data Engineering Spark3 for AWS

  • Data Engineering Spark3 for Azure

  • Data Engineering Spark3 for GCP
Cluster template name
  • CDP - Data Engineering: Apache Spark, Apache Hive, Apache Oozie

  • CDP - Data Engineering HA: Apache Spark, Apache Hive, Hue, Apache Oozie

    See the architectural information below for the Data Engineering HA clusters

  • CDP - Data Engineering: Apache Spark3
Included services
  • Data Analytics Studio (DAS)
  • HDFS
  • Hive
  • Hue
  • Livy
  • Oozie
  • Spark
  • Yarn
  • Zeppelin
  • ZooKeeper
Compatible runtime version
7.1.0, 7.2.0, 7.2.1, 7.2.2, 7.2.6, 7.2.7, 7.2.8, 7.2.9, 7.2.10, 7.2.11

Architecture of the Data Engineering HA for Azure cluster (Preview)

The Data Engineering HA for Azure cluster shape provides failure resilience for several of the Data Engineering HA services, including Knox, Oozie, HDFS, HS2, Hue, YARN, and HMS.

Services that do not yet run in HA mode include Cloudera Manager, DAS, Livy, and Zeppelin.

Component

Failure

User experience

Knox One of the Knox services is down External users will still be able to access all of the UIs, APIs, and JDBC.
Cloudera Manager The first node in manager host group is down The cluster operations (such as repair, scaling, and upgrade) will not work.
Cloudera Manager The second node in the manager host group is down No impact.
HMS One of the HMS services is down No impact.
Hue One of the Hue services is down in master host group No impact.
HS2 One of the HS2 services is down in the master host group External users will still be able to access the Hive service via JDBC. But if Hue was accessing that particular service it will not failover to the other host. The quick fix for Hue is to restart Hue to be able to use Hive functionality.
Yarn One of the Yarn services is down No impact.
HDFS One of the HDFS services is down No impact.
Nginx Nginx in one of the manager hosts is down Fifty percent of the UI, API, and JDBC calls will be affected. If the entire manager node is down, there is no impact. This is caused by the process of forwarding and health checking that is done by the network load-balancer.
Oozie One of the Oozie servers is down in the manager host group.

No impact for AWS and Azure as of Cloudera Runtime version 7.2.11.

If you create a custom template for DE HA, follow these two rules:

  1. Oozie must be in single hostgroup.
  2. Oozie and Hue must not be in the same hostgroup.

Custom templates

Any custom DE HA template that you create must be forked from the default templates of the corresponding version. You must create a custom cluster definition for this with the JSON parameter “enableLoadBalancers”: true. As with the template, the custom cluster definition must be forked from the default cluster definition. You are allowed to modify the instance types and disks in the custom cluster definition. You must not change the placement of the services like Cloudera Manager, Oozie, and Hue. Currently the custom template is fully supported only via CLI.

The simplest way to change the DE HA definition is to create a custom cluster definition. In the Create Data Hub UI when you click Advanced Options, the default definition is not used fully, which will cause issues in the HA setup.