Data Engineering clusters
Learn about the default Data Engineering clusters, including cluster definition and template names, included services, and compatible Runtime version.
Data Engineering provides a complete data processing solution, powered by Apache Spark and Apache Hive. Spark and Hive enable fast, scalable, fault-tolerant data engineering and analytics over petabytes of data.
Data Engineering cluster definition
This Data Engineering template includes a standalone deployment of Spark and Hive, as well as Apache Oozie for job scheduling and orchestration, Apache Livy for remote job submission, and Hue and Apache Zeppelin for job authoring and interactive analysis.
- Cluster definition names
-
-
Data Engineering for Google Cloud
- Data Engineering HA - Spark3 for Google Cloud
- Data Engineering Spark3 for Google Cloud
-
- Cluster template name
-
-
Data Engineering: Apache Spark3, Apache Hive, Apache Oozie
-
Data Engineering: HA: Apache Spark3, Apache Hive, Apache Oozie
See the architectural information below for the Data Engineering HA clusters
- Data Engineering: Apache Spark3, Apache Hive, Apache Oozie
-
- Included services
-
- HDFS
- Hive
- Hue
- Livy
- Spark 3
- Yarn
- Zeppelin
- ZooKeeper
- Oozie is supported for Spark 3 as of Runtime version 7.2.18
- Hive Warehouse Connector is supported as of Runtime version 7.2.16.
- Compatible runtime version
- 7.2.16, 7.2.17, 7.2.18
Topology of the Data Engineering cluster
Topology is a set of host groups that are defined in the cluster template and cluster definition used by Data Engineering. Data Engineering uses the following topology:
Host group | Description | Node configuration |
---|---|---|
Master | The master host group runs the components for managing the cluster resources including Cloudera Manager (CM), Name Node, Resource Manager, as well as other master components such HiveServer2, HMS, Hue etc. | 1 For Runtime versions earlier than 7.2.14: GCP : e2-standard-16; pd-ssd - 100 GB For Runtime versions 7.2.14+ DE, DE Spark3, and DE HA: GCP : e2-standard-16; pd-ssd - 100 GB |
Worker | The worker host group runs the components that are used for executing processing tasks (such as NodeManager) and handling storing data in HDFS such as DataNode. | 3 For Runtime versions earlier than 7.2.14: GCP : e2-standard-8; pd-ssd - 100 GB For Runtime versions 7.2.14+ DE and DE Spark3: GCP : e2-standard-8; pd-ssd - 100 GB DE HA: GCP : e2-standard-8; pd-ssd - 100 GB |
Compute | The compute host group can optionally be used for running data processing tasks (such as NodeManager). By default the number of compute nodes is set to 1 for proper configurations of YARN containers. This node group can be scaled down to 0 when there are no compute needs. Additionally, if load-based auto-scaling is enabled with minimum count set to 0, the compute nodegroup will be resized to 0 automatically. | 0+ For Runtime versions earlier than 7.2.14: GCP : e2-standard-8; pd-ssd - 100 GB For Runtime versions 7.2.14+ DE and DE Spark3: GCP : e2-standard-8; pd-ssd - 100 GB DE HA: GCP : e2-standard-8; pd-ssd - 100 GB |
Gateway | The gateway host group can optionally be used for connecting to the cluster endpoints like Oozie, Beeline etc. This nodegroup does not run any critical services. This nodegroup resides in the same subnet as the rest of the nodegroups. If additional software binaries are required they could be installed using recipes. | 0+ GCP : e2-standard-8; pd-ssd - 100 GB |
Service configurations | |||
---|---|---|---|
Master host group CM, HDFS, Hive (on Tez), HMS, Yarn RM, Oozie, Hue, DAS, Zookeeper, Livy, Zeppelin and Sqoop |
Gateway host group Configurations for the services on the master node |
Worker host group Data Node and YARN NodeManager |
Compute group YARN NodeManager |
Configurations
Note the following:
- There is a Hive Metastore Service (HMS) running in the cluster that talks to the same database instance as the Data Lake in the environment.
- If you use CLI to create the cluster, you can optionally pass an argument to create an external database for the cluster use such as CM, Oozie, Hue, and DAS. This database is by default embedded in the master node external volume. If you specify the external database to be of type HA or NON_HA, the database will be provisioned in the cloud provider. For all these types of databases the lifecycle is still associated with the cluster, so upon deletion of the cluster, the database will also be deleted.
- The HDFS in this cluster is for storing the intermediary processing data. For resiliency, store the data in the cloud object stores.
- For high availability requirements choose the Data Engineering High Availability cluster shape.
GCP HA (Preview)
Custom templates
Any custom DE HA template that you create must be forked from the default templates of the
corresponding version. You must create a custom cluster definition for this with the JSON
parameter “enableLoadBalancers”: true
, using the
create-aws/azure/gcp-cluster
CLI command parameter
--request-template
. Support for pre-existing custom cluster definitions
will be added in a future release. As with the template, the custom cluster definition must
be forked from the default cluster definition. You are allowed to modify the instance types
and disks in the custom cluster definition. You must not change the placement of the
services like Cloudera Manager, Oozie, and Hue. Currently the custom template is fully
supported only via CLI.
The simplest way to change the DE HA definition is to create a custom cluster definition. In the Create Data Hub UI when you click Advanced Options, the default definition is not used fully, which will cause issues in the HA setup.