Storage Space Planning for Cloudera Manager
Minimum Required Role: Full Administrator
Cloudera Manager tracks metrics of services, jobs, and applications in many background processes. All of these metrics require storage. Depending on the size of your organization, this storage may be local or remote, disk-based or in a database, managed by you or by another team in another location.
Most system administrators are aware of common locations like /var/log/ and the need for these locations to have adequate space. This topic enables you to familiarize yourself with and plan for the storage needs and data storage locations used by the Cloudera Manager Server and the Cloudera Management Service to store metrics and data.
Failing to plan for the storage needs of all components of the Cloudera Manager Server and the Cloudera Management Service can negatively impact your cluster in the following ways:
-
The cluster does not have historical operational data to meet internal requirements.
-
The cluster is missing critical audit information that was not gathered nor retained for the required length of time.
-
Administrators are unable to research past events or health status.
-
Administrators do not have historical MR1, YARN, or Impala usage data when they need to reference or report on them later.
-
There are gaps in metrics collection and charts.
-
The cluster experiences data loss due to filling storage locations to 100% of capacity. The resulting damage from such an event can impact many other components.
There is a main theme here: you need to architect your data storage needs well in advance. You need to inform your operations staff about your critical data storage locations for each host so that they can provision your infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in your build documentation and run books.
This topic describes both local disk storage and RDBMS storage and these types of storage are labeled within the discussions. This distinction is made both for storage planning and also to inform migration of roles from one host to another, preparing backups, and other lifecycle management events.
The following tables provide details about each individual Cloudera Management service with the goal of enabling Cloudera Manager Administrators to make appropriate storage and lifecycle planning decisions.
Cloudera Manager Server
Entity | Cloudera Manager Server Configuration |
---|---|
Default Storage Location |
RDBMS: Use any supported RDBMS to store the core configuration of your Cloudera Manager database and all cluster, service, and role configurations. See Cloudera Manager and Managed Service Datastores. Disk: Cloudera Manager Server Local Data Storage Directory (command_storage_path) on the host where the Cloudera Manager Server is configured to run. This local path is used by Cloudera Manager for storing data, including command result files. Critical configurations are not stored in this location. /var/lib/cloudera-scm-server/ |
Storage Configuration Defaults, Minimum, or Maximum | There are no direct storage defaults relevant to this entity. |
Where to Control Data Retention or Size | The size of the Cloudera Manager Server database varies depending on the number of managed hosts and the number of
discrete commands that have been run in the cluster. To configure the size of the retained command results in the Cloudera Manager Administration Console, select
|
and edit the following property:
Sizing, Planning & Best Practices | The Cloudera Manager Server database is the most vital configuration store in a Cloudera Manager deployment. This
database holds the configuration for clusters, services, roles, and other necessary information that defines a deployment of Cloudera Manager and its managed hosts.
You should perform regular, verified, remotely-stored backups of the Cloudera Manager Server database. |
Cloudera Management Service
Entity | Activity Monitor |
---|---|
Default Storage Location | Any supported RDBMS. |
Storage Configuration Defaults / Minimum / Maximum | Default: 14 Days worth of MapReduce (MRv1) jobs/tasks |
Where to Control Data Retention or Size |
You control Activity Monitor storage usage by configuring the number of days or hours of data to retain. Older data are purged. To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices |
The Activity Monitor only monitors MapReduce jobs, and does not monitor not YARN applications. If you no longer use MapReduce (MRv1) in your cluster, the Activity Monitor is not required for Cloudera Manager 5 (or higher) or CDH 5 (or higher). The amount of storage space needed for 14 days worth of MapReduce activities can vary greatly and directly depends on the size of your cluster and the level of activity that uses MapReduce. It may be necessary to adjust and readjust the amount of storage as you determine the "stable state" and "burst state" of the MapReduce activity in your cluster. For example, consider the following test cluster and usage:
Sizing observations for this cluster:
|
Entity | Service Monitor Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-service-monitor/ on the host where the Service Monitor role is configured to run. |
Storage Configuration Defaults / Minimum / Maximum |
Total: ~12 GiB Minimum (No Maximum) |
Where to Control Data Retention or Size |
Service Monitor data growth is controlled by configuring the maximum amount of storage space it may use. To configure data retention in Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices | The Service Monitor gathers metrics about configured roles and services in your cluster and also runs active health tests. These health tests run regardless of idle and use periods, because they are always relevant. The Service Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow, even in an idle cluster. |
Entity | Host Monitor |
---|---|
Default Storage Location | /var/lib/cloudera-host-monitor/ on the host where the Host Monitor role is configured to run. |
Storage Configuration Defaults / Minimum/ Maximum | Default + Minimum: 10 GiB Host Time Series Storage |
Where to Control Data Retention or Size |
Host Monitor data growth is controlled by configuring the maximum amount of storage space it may use. See Data Storage for Monitoring Data. To configure these data retention in Cloudera Manager Administration Console:
|
Sizing, Planning and Best Practices | The Host Monitor gathers metrics about host-level items of interest (for example: disk space usage, RAM, CPU usage, swapping, etc) and also informs host health tests. The Host Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow fairly linearly, even in an idle cluster. |
Entity | Event Server |
---|---|
Default Storage Location | /var/lib/cloudera-scm-eventserver/ on the host where the Event Server role is configured to run. |
Storage Configuration Defaults | 5,000,000 events retained |
Where to Control Data Retention or Minimum /Maximum |
The amount of storage space the Event Server uses is influenced by configuring how many discrete events it may retain. To configure data retention in Cloudera Manager Administration Console,
|
Sizing, Planning, and Best Practices |
The Event Server is a managed Lucene index that collects relevant events that happen within your cluster, such as results of health tests, log events that are created when a log entry matches a set of rules for identifying messages of interest and makes them available for searching, filtering and additional action. You can view and filter events on the tab of the Cloudera Manager Administration Console. You can also poll this data using the Cloudera Manager API. |
Entity | Reports Manager |
---|---|
Default Storage Location |
RDBMS: Any Supported RDBMS. See Installing and Configuring Databases. Disk: /var/lib/cloudera-scm-headlamp/ on the host where the Reports Manager role is configured to run. |
Storage Configuration Defaults |
RDBMS: There are no exposed defaults or configurations to directly cull or purge the size of this data set. Disk: There are no configuration defaults to influence the size of this location. The size of the data in this location depends not only on the size of the HDFS fsimage, but also on the HDFS path complexity. |
Where to Control Data Retention or Minimum / Maximum |
The Reports Manager uses space in two main locations, one local on the host where Reports Manager runs, and the other in the RDBMS provided to it for its historical aggregation. The RDBMS is not required to be on the same host where the Reports Manager runs. |
Sizing, Planning, and Best Practices |
Reports Manager downloads the fsimage from the NameNode every 60 minutes (default) and stores it locally to perform operations against, including indexing the HDFS filesystem structure represented in the fsimage. A larger fsimage, or more deep and complex paths within HDFS consume more disk space. Reports Manager has no control over the size of the fsimage. If your total HDFS usage trends upward notably or you add excessively long paths in HDFS, it may be necessary to revisit and adjust the amount of space allocated to the Reports Manager for its local storage. Periodically monitor, review and readjust the local storage allocation. |
Cloudera Navigator
By default, during the Cloudera Manager Installation wizard the Navigator Audit Server and Navigator Metadata Server are assigned to the same host as the Cloudera Management Service monitoring roles. This configuration works for a small cluster, but should be updated before the cluster grows. You can either change the configuration at installation time or move the Navigator Metadata Server if necessary.
Entity | Navigator Audit Server |
---|---|
Default Storage Location |
Any Supported RDBMS. |
Storage Configuration Defaults | Default: 90 Days retention |
Where to Control Data Retention or Min/Max |
Navigator Audit Server storage usage is controlled by configuring how many days of data it may retain. Any older data are purged. To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices |
The size of the Navigator Audit Server database directly depends on the number of audit events the cluster’s audited services generate. Normally the volume of HDFS audits exceed the volume of other audits (all other components like MRv1, Hive and Impala read from HDFS, which generates additional audit events). The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster of 50 hosts with ~100K audit events generated per hour, the Navigator Audit Server database would consume ~2.5 GB per day. To retain 90 days of audits at that level, plan for a database size of around 250 GB. If other configured cluster services generate roughly the same amount of data as the HDFS audits, plan for the Navigator Audit Server database to require around 500 GB of storage for 90 days of data. Notes:
Use this table to map Product Compatibility Matrix for Cloudera Navigator versions to Cloudera Manager versions. |
Entity | Navigator Metadata Server |
---|---|
Default Storage Location |
RDBMS: Any Supported RDBMS. See Installing and Configuring Databases. Disk: /var/lib/cloudera-scm-navigator/ on the host where the Navigator Metadata Server role is configured to run. |
Storage Configuration Defaults |
RDBMS: There are no exposed defaults or configurations to directly cull or purge the size of this data set. Disk: There are no configuration defaults to influence the size of this location. You can change the location itself with the Navigator Metadata Server Storage Dir property. The size of the data in this location depends on the amount of metadata in the system (HDFS fsimage size, Hive Metastore size) and activity on the system (the number of MapReduce Jobs run, Hive queries executed, etc). |
Where to Control Data Retention or Min/Max |
RDBMS: There is no maximum size of this data and no way to purge data that is old. Disk: There is no maximum size of this data. As data in the cluster grows its metadata is captured and stored in the location specified by the Navigator Metadata Server Storage Dir property. |
Sizing, Planning, and Best Practices |
Memory: Two activities determine Navigator Metadata Server resource requirements:
The Navigator Metadata Server uses Solr to store, index, and query metadata. Indexing happens during extraction. Querying is fast and efficient because the data is indexed. The Navigator Metadata Server memory requirements are based on amount of data that is stored and indexed. The Solr instance runs in process with Navigator, so you should set Java heap for the Navigator Metadata Server accordingly. When the Navigator Metadata Server starts up it logs the number of documents contained in Solr. For example: 2015-11-11 09:24:58,013 INFO com.cloudera.nav.server.NavServerUtil: Found 68813088 documents in solr core nav_elements 2015-11-11 09:24:58,705 INFO com.cloudera.nav.server.NavServerUtil: Found 78813930 documents in solr core nav_relations To compute the memory required by the Metadata Server during normal operation, use the number of documents in nav_elements * 200. So for the above example, the recommended amount of memory would be (68813088 * 200) or about 14 GB. For upgrade, use the number of documents in nav_elements + nav_relations. If you use the number in the above example, for upgrade you would need ((68813088 + 78813930) * 200) or about 30 GB. RDBMS: The database is used to store policies and authorization data. The dataset is small, but this database is also used during a Solr schema upgrade, where Solr documents are extracted and inserted again in Solr. This has same space requirements as above use case, but the space is only used temporarily during product upgrades. Use this matrix to map Cloudera Navigator and Cloudera Manager versions. Disk: This filesystem location contains all the metadata that is extracted from managed clusters. The data is stored in Solr, so this is the location where Solr stores its index and documents. Depending on the size of the cluster, this data can occupy tens of gigabytes. A guideline is to look at the size of HDFS fsimage and allocate two to three times that size as the initial size. The data here is incremental and continues to grow as activity is performed on the cluster. The rate of growth can be on order of tens of megabytes per day. |
General Performance Notes
When possible:
-
For entities that use an RDBMS, install the database on the same host as the service.
-
Provide a dedicated spindle to the RDBMS or datastore data directory to avoid disk contention with other read/write activity.
Cluster Lifecycle Management with Cloudera Manager
Cloudera Manager clusters that use parcels to provide CDH and other components require adequate disk space in the following locations:
Parcel Lifecycle Path (default) | Notes |
---|---|
Local Parcel Repository Path
/opt/cloudera/parcel-repo |
This path exists only on the host where Cloudera Manager Server (cloudera-scm-server) runs. The Cloudera Manager Server stages all new parcels in this location as it fetches them from any external repositories. Cloudera Manager Agents are then instructed to fetch the parcels from this location when the administrator distributes the parcel using the Cloudera Manager Administration Console or the Cloudera Manager API. Sizing and PlanningThe default location is /opt/cloudera/parcel-repo but you can configure another local filesystem location on the host where Cloudera Manager Server runs. See Parcel Configuration Settings. Provide sufficient space to hold all the parcels you download from all configured Remote Parcel Repository URLs (See Parcel Configuration Settings). Cloudera Manager deployments that manage multiple clusters store all applicable parcels for all clusters. Parcels are provided for each operating system, so be aware that heterogeneous clusters (distinct operating systems represented in the cluster) require more space than clusters with homogeneous operating systems. For example, a cluster with both RHEL5.x and 6.x hosts must hold -el5 and -el6 parcels in the Local Parcel Repository Path, which requires twice the amount of space. Lifecycle Management and Best PracticesDelete any parcels that are no longer in use from the Cloudera Manager Administration Console, (never delete them manually from the command line) to recover disk space in the Local Parcel Repository Path and simultaneously across all managed cluster hosts which hold the parcel. Backup ConsiderationsPerform regular backups of this path, and consider it a non-optional accessory to backing up Cloudera Manager Server. If you migrate Cloudera Manager Server to a new host or restore it from a backup (for example, after a hardware failure), recover the full content of this path to the new host, in the /opt/cloudera/parcel-repo directory before starting any cloudera-scm-agent or cloudera-scm-server processes. |
Parcel Cache
/opt/cloudera/parcel-cache |
Managed Hosts running a Cloudera Manager Agent stage distributed parcels into this path (as .parcel files, unextracted). Do not manually manipulate this directory or its files. Sizing and PlanningProvide sufficient space per-host to hold all the parcels you distribute to each host. You can configure Cloudera Manager to remove these cached .parcel files after they are extracted and placed in /opt/cloudera/parcels/. It is not mandatory to keep these temporary files but keeping them avoids the need to transfer the .parcel file from the Cloudera Manager Server repository should you need to extract the parcel again for any reason. To configure this behavior in the Cloudera Manager Administration Console, select |
Host Parcel Directory
/opt/cloudera/parcels |
Managed cluster hosts running a Cloudera Manager Agent extract parcels from the /opt/cloudera/parcel-cache directory into this path upon parcel activation. Many critical system symlinks point to files in this path and you should never manually manipulate its contents. Sizing and PlanningProvide sufficient space on each host to hold all the parcels you distribute to each host. Be aware that the typical CDH parcel size is slightly larger than 1 GB per parcel. If you maintain various versions of parcels staged before and after upgrading, be aware of the disk space implications. You can configure Cloudera Manager to automatically remove older parcels once they are no longer in use. As an administrator you can always manually delete parcel versions not in use, but configuring these settings can handle the deletion automatically, in case you forget. To configure this behavior in the Cloudera Manager Administration Console, select and configure the following property:
|
Task | Description |
---|---|
Activity Monitor (One-time) |
The Activity Monitor only works against a MapReduce (MR1) service, not YARN. So if your deployment has fully migrated to YARN and no longer uses a MapReduce (MR1) service, your Activity Monitor database is no longer growing. If you have waited longer than the default Activity Monitor retention period (14 days) to address this point, then the Activity Monitor has already purged it all for you and your database is mostly empty. If your deployment meets these conditions, consider cleaning up by dropping the Activity Monitor database (again, only when you are satisfied that you no longer need the data or have confirmed that it is no longer in use) and the Activity Monitor role. |
Service Monitor and Host Monitor (One-time) |
For those who used Cloudera Manager version 4.x and have now upgraded to version 5.x: The Service Monitor and Host Monitor were migrated from their previously-configured RDBMS into a dedicated time series store used solely by each of these roles respectively. After this happens, there is still legacy database connection information in the configuration for these roles. This was used to allow for the initial migration but is no longer being used for any active work. After the above migration has taken place, the RDBMS databases previously used by the Service Monitor and Host Monitor are no longer used. Space occupied by these databases is now recoverable. If appropriate in your environment (and you are satisfied that you have long-term backups or do not need the data on disk any longer), you can drop those databases. |
Ongoing Space Reclamation |
Cloudera Management Services are automatically rolling up, purging or otherwise consolidating aged data for you in the background. Configure retention and purging limits per-role to control how and when this occurs. These configurations are discussed per-entity above. Adjust the default configurations to meet your space limitations or retention needs. |
Log Files
All CDH cluster hosts write out separate log files for each role instance assigned to the host. Cluster administrators can monitor and manage the disk space used by these roles and configure log rotation to prevent log files from consuming too much disk space.