Cluster configuration best practices

Review the cluster configuration best practices.

ZooKeeper
Learn why it is recommended to install ZooKeeper on a node where it can have unobstructed access to the disk.
HDFS
Learn about the various considerations and bottlenecks when planning cluster configuration for the HDFS service.
YARN
The YARN service manages MapReduce and Spark tasks. Applications run in YARN containers, which use Linux Cgroups for resource management and process isolation.
Impala
The Impala service is a distributed, MPP database engine for interactive performance of SQL queries over large data sets. Impala performs best when it can operate on data in memory. Therefore, Impala is often configured with a very large heap size.
Spark
Cloudera supports Spark on YARN-managed deployments for a more flexible and consistent resource management approach.
HBase
By default, major compactions happen every 7 days. The next major compaction happens 7 days after the last one has finished. This means that the actual time that major compaction happens can impact production processes, which is not ideal if it is desired to run compactions at a specific known off-peak hour, such as at 3 AM.
Search
Cloudera Search is a service based on Apache Solr. It provides a distributed search engine service. Search engines are often expected to provide fast, interactive performance so it is important to allocate sufficient RAM to the Search service
Oozie
Writing Oozie XML configuration files can be tedious and error-prone. Cloudera recommends that you use the Oozie editor that is embedded in Hue for creating, scheduling, and executing Oozie workflows.
Kafka
Kafka’s default configuration with Cloudera Manager is suited to start development quickly. Several default settings should be changed before deploying a Cloudera Kafka cluster in production.
Kudu
Review the partitioning guidelines and limitations before deploying the Kudu service on your cluster.

Parent topic: Cluster configuration