Storage Space Planning for Cloudera Manager
This topic helps you plan for the storage needs and data storage locations used by the Cloudera Manager Server and the Cloudera Management Service to store metrics and data.
Minimum Required Role: Full Administrator. This feature is not available when using Cloudera Manager to manage Data Hub clusters.
Cloudera Manager tracks metrics of services, jobs, and applications in many background processes. All of these metrics require storage. Depending on the size of your organization, this storage can be local or remote, disk-based or in a database, managed by you or by another team in another location.
Most system administrators are aware of common locations like /var/log/
and
the need for these locations to have adequate space. Failing to plan for the storage needs of
all components of the Cloudera Manager Server and the Cloudera Management Service can
negatively impact your cluster in the following ways:
- The cluster might not be able to retain historical operational data to meet internal requirements.
- The cluster might miss critical audit information that was not gathered or retained for the required length of time.
- Administrators might be unable to research past events or health status.
- Administrators might not have historical MR1, YARN, or Impala usage data when they need to reference or report on them later.
- There might be gaps in metrics collection and charts.
- The cluster might experience data loss due to filling storage locations to 100% of capacity. The effects of such an event can impact many other components.
The main theme here is that you must architect your data storage needs well in advance. You must inform your operations staff about your critical data storage locations for each host so that they can provision your infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in your internal build documentation and run books.
This topic describes both local disk storage and RDBMS storage. This distinction is made both for storage planning and also to inform migration of roles from one host to another, preparing backups, and other lifecycle management events.
The following tables provide details about each individual Cloudera Management service to enable Cloudera Manager administrators to make appropriate storage and lifecycle planning decisions.
Configuration Topic | Cloudera Manager Server Configuration |
---|---|
Default Storage Location | RDBMS: Any Supported RDBMS. Disk: Cloudera Manager
Server Local Data Storage Directory ( Default setting:
|
Storage Configuration Defaults, Minimum, or Maximum | There are no direct storage defaults relevant to this entity. |
Where to Control Data Retention or Size | The size of the Cloudera Manager Server database varies depending on the number
of managed hosts and the number of discrete commands that have been run in the
cluster. To configure the size of the retained command results in the Cloudera Manager
Administration Console, select
|
and edit the following property:
Sizing, Planning & Best Practices | The Cloudera Manager Server database is the most vital configuration store in a
Cloudera Manager deployment. This database holds the configuration for clusters,
services, roles, and other necessary information that defines a deployment of Cloudera
Manager and its managed hosts. Make sure that you perform regular, verified, remotely-stored backups of the Cloudera Manager Server database. |
Configuration Topic | Activity Monitor |
---|---|
Default Storage Location | Any Supported RDBMS. |
Storage Configuration Defaults / Minimum / Maximum | Default: 14 Days worth of MapReduce (MRv1) jobs/tasks |
Where to Control Data Retention or Size |
You control Activity Monitor storage usage by configuring the number of days or hours of data to retain. Older data is purged. To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices |
The Activity Monitor only monitors MapReduce jobs, and does not monitor YARN applications. The amount of storage space needed for 14 days worth of MapReduce activities can vary greatly and directly depends on the size of your cluster and the level of activity that uses MapReduce. It might be necessary to adjust and readjust the amount of storage as you determine the "stable state" and "burst state" of the MapReduce activity in your cluster. For example, consider the following test cluster and usage:
Sizing observations for this cluster:
|
Configuration Topic | Service Monitor Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-service-monitor/ on the host where the Service
Monitor role is configured to run. |
Storage Configuration Defaults / Minimum / Maximum |
Total: ~12 GiB Minimum (No Maximum) |
Where to Control Data Retention or Size |
Service Monitor data growth is controlled by configuring the maximum amount of storage space it can use. To configure data retention in Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices | The Service Monitor gathers metrics about configured roles and services in your cluster and also runs active health tests. These health tests run regardless of idle and use periods, because they are always relevant. The Service Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow, even in an idle cluster. |
Configuration Topic | Host Monitor Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-host-monitor/ on the host where the Host
Monitor role is configured to run. |
Storage Configuration Defaults / Minimum/ Maximum | Default (and minimum): 10 GiB Host Time Series Storage |
Where to Control Data Retention or Size | Host Monitor data growth is controlled by configuring the maximum amount of
storage space it can use. See Data Storage for Monitoring Data. To configure these data retention configuration properties in the
Cloudera Manager Administration Console:
|
Sizing, Planning and Best Practices | The Host Monitor gathers metrics about host-level items of interest (for example: disk space usage, RAM, CPU usage, swapping, etc) and also informs host health tests. The Host Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow fairly linearly, even in an idle cluster. |
Configuration Topic | Event Server Configuration |
---|---|
Default Storage Location | /var/lib/cloudera-scm-eventserver/ on the host where the Event
Server role is configured to run. |
Storage Configuration Defaults | 5,000,000 events retained |
Where to Control Data Retention or Minimum /Maximum |
The amount of storage space the Event Server uses is influenced by configuring how many discrete events it can retain. To configure data retention in Cloudera Manager Administration Console,
|
Sizing, Planning, and Best Practices |
The Event Server is a managed Lucene index that collects relevant events that happen within your cluster, such as results of health tests, log events that are created when a log entry matches a set of rules for identifying messages of interest and makes them available for searching, filtering and additional action. You can view and filter events on the tab of the Cloudera Manager Administration Console. You can also poll this data using the Cloudera Manager API. |
Configuration Topic | Reports Manager Configuration |
---|---|
Default Storage Location | RDBMS: Any Supported RDBMS. Disk:
|
Storage Configuration Defaults |
RDBMS: There are no configurable parameters to directly control the size of this data set. Disk: There are no configurable parameters to directly control the size of this data set. The storage utilization depends not only on the size of the HDFS fsimage, but also on the HDFS file path complexity. Longer file paths contribute to more space utilization. |
Where to Control Data Retention or Minimum / Maximum |
The Reports Manager uses space in two main locations: on the Reports Manager host and on its supporting database. Cloudera recommends that the database be on a separate host from the Reports Manager host for process isolation and performance. |
Sizing, Planning, and Best Practices | Reports Manager downloads the fsimage from the NameNode (every 60 minutes
by default) and stores it locally to perform operations against, including indexing
the HDFS filesystem structure. More files and directories results in a larger
fsimage, which consumes more disk space. Reports Manager has no control over the size of the fsimage. If your total HDFS usage trends upward notably or you add excessively long paths in HDFS, it might be necessary to revisit and adjust the amount of local storage allocated to the Reports Manager. Periodically monitor, review, and adjust the local storage allocation. |
Configuration Topic | Navigator Audit Server Configuration |
---|---|
Default Storage Location | Any Supported RDBMS. |
Storage Configuration Defaults | Default: 90 Days retention |
Where to Control Data Retention or Min/Max | Navigator Audit Server storage usage is controlled by configuring how many days
of data it can retain. Any older data is purged. To configure data retention in the Cloudera Manager Administration Console:
|
Sizing, Planning, and Best Practices | The size of the Navigator Audit Server database directly depends on the number of
audit events the cluster’s audited services generate. Normally the volume of HDFS
audits exceeds the volume of other audits (all other components like MRv1, Hive and
Impala read from HDFS, which generates additional audit events). The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster of 50 hosts with ~100K audit events generated per hour, the Navigator Audit Server database would consume ~2.5 GB per day. To retain 90 days of audits at that level, plan for a database size of around 250 GB. If other configured cluster services generate roughly the same amount of data as the HDFS audits, plan for the Navigator Audit Server database to require around 500 GB of storage for 90 days of data. Notes:
To map Cloudera Navigator versions to Cloudera Manager versions, see Product Compatibility Matrix for Cloudera Navigator. |
Configuration Topic | Navigator Metadata Server Configuration |
---|---|
Default Storage Location |
RDBMS: Any Supported RDBMS. Disk:
|
Storage Configuration Defaults |
RDBMS: There are no exposed defaults or configurations to directly cull or purge the size of this data set. Disk: There are no configuration defaults to influence the size of this location. You can change the location itself with the Navigator Metadata Server Storage Dir property. The size of the data in this location depends on the amount of metadata in the system (HDFS fsimage size, Hive Metastore size) and activity on the system (the number of MapReduce Jobs run, Hive queries executed, etc). |
Where to Control Data Retention or Min/Max |
RDBMS: The Navigator Metadata Server database should be carefully tuned to support large volumes of metadata. Disk: The Navigator Metadata Server index (an embedded Solr instance) can consume lots of disk space at the location specified for the Navigator Metadata Server Storage Dir property. Ongoing maintenance tasks include purging metadata from the system. |
Sizing, Planning, and Best Practices | Memory: See Navigator Metadata Server Tuning.RDBMS: The database is used to store policies and authorization data. The dataset is small, but this database is also used during a Solr schema upgrade, where Solr documents are extracted and inserted again in Solr. This has same space requirements as above use case, but the space is only used temporarily during product upgrades. Use the product compatibility matrix to map Cloudera Navigator and Cloudera Manager versions. Disk: This filesystem location contains all the metadata that is extracted from managed clusters. The data is stored in Solr, so this is the location where Solr stores its index and documents. Depending on the size of the cluster, this data can occupy tens of gigabytes. A guideline is to look at the size of HDFS fsimage and allocate two to three times that size as the initial size. The data here is incremental and continues to grow as activity is performed on the cluster. The rate of growth can be on order of tens of megabytes per day. |
General Performance Notes
When possible:
-
For entities that use an RDBMS, install the database on a separate host from the service, and consolidate roles that use databases on as few servers as possible.
-
Provide a dedicated spindle to the RDBMS or datastore data directory to avoid disk contention with other read/write activity.