Storage Space Planning for Cloudera Manager

This topic helps you plan for the storage needs and data storage locations used by the Cloudera Manager Server and the Cloudera Management Service to store metrics and data.

Minimum Required Role: Full Administrator. This feature is not available when using Cloudera Manager to manage Data Hub clusters.

Cloudera Manager tracks metrics of services, jobs, and applications in many background processes. All of these metrics require storage. Depending on the size of your organization, this storage can be local or remote, disk-based or in a database, managed by you or by another team in another location.

Most system administrators are aware of common locations like /var/log/ and the need for these locations to have adequate space. Failing to plan for the storage needs of all components of the Cloudera Manager Server and the Cloudera Management Service can negatively impact your cluster in the following ways:

  • The cluster might not be able to retain historical operational data to meet internal requirements.
  • The cluster might miss critical audit information that was not gathered or retained for the required length of time.
  • Administrators might be unable to research past events or health status.
  • Administrators might not have historical MR1, YARN, or Impala usage data when they need to reference or report on them later.
  • There might be gaps in metrics collection and charts.
  • The cluster might experience data loss due to filling storage locations to 100% of capacity. The effects of such an event can impact many other components.

The main theme here is that you must architect your data storage needs well in advance. You must inform your operations staff about your critical data storage locations for each host so that they can provision your infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in your internal build documentation and run books.

This topic describes both local disk storage and RDBMS storage. This distinction is made both for storage planning and also to inform migration of roles from one host to another, preparing backups, and other lifecycle management events.

The following tables provide details about each individual Cloudera Management service to enable Cloudera Manager administrators to make appropriate storage and lifecycle planning decisions.

Table 1. Cloudera Manager Server
Configuration Topic Cloudera Manager Server Configuration
Default Storage Location RDBMS:

Any Supported RDBMS.

Disk:

Cloudera Manager Server Local Data Storage Directory (command_storage_path) on the host where the Cloudera Manager Server is configured to run. This local path is used by Cloudera Manager for storing data, including command result files. Critical configurations are not stored in this location.

Default setting: /var/lib/cloudera-scm-server/

Storage Configuration Defaults, Minimum, or Maximum There are no direct storage defaults relevant to this entity.
Where to Control Data Retention or Size The size of the Cloudera Manager Server database varies depending on the number of managed hosts and the number of discrete commands that have been run in the cluster. To configure the size of the retained command results in the Cloudera Manager Administration Console, select Administration > Settings and edit the following property:
Command Eviction Age
Length of time after which inactive commands are evicted from the database.

Default is two years.

Sizing, Planning & Best Practices The Cloudera Manager Server database is the most vital configuration store in a Cloudera Manager deployment. This database holds the configuration for clusters, services, roles, and other necessary information that defines a deployment of Cloudera Manager and its managed hosts.

Make sure that you perform regular, verified, remotely-stored backups of the Cloudera Manager Server database.

Table 2. Cloudera Management Service - Activity Monitor Configuration
Configuration Topic Activity Monitor
Default Storage Location Any Supported RDBMS.
Storage Configuration Defaults / Minimum / Maximum Default: 14 Days worth of MapReduce (MRv1) jobs/tasks
Where to Control Data Retention or Size

You control Activity Monitor storage usage by configuring the number of days or hours of data to retain. Older data is purged.

To configure data retention in the Cloudera Manager Administration Console:
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Activity Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the following properties or search for them by typing the property name in the Search box:
    Purge Activities Data at This Age
    In Activity Monitor, purge data about MapReduce jobs and aggregate activities when the data reaches this age in hours. By default, Activity Monitor keeps data about activities for 336 hours (14 days).
    Purge Attempts Data at This Age
    In the Activity Monitor, purge data about MapReduce attempts when the data reaches this age in hours. Because attempt data can consume large amounts of database space, you might want to purge it more frequently than activity data. By default, Activity Monitor keeps data about attempts for 336 hours (14 days).
    Purge MapReduce Service Data at This Age
    The number of hours of past service-level data to keep in the Activity Monitor database, such as total slots running. The default is to keep data for 336 hours (14 days).
  6. Enter a Reason for change, and then click Save Changes to commit the changes.
Sizing, Planning, and Best Practices

The Activity Monitor only monitors MapReduce jobs, and does not monitor YARN applications.

If you no longer use MapReduce (MRv1) in your cluster, the Activity Monitor is not required.

The amount of storage space needed for 14 days worth of MapReduce activities can vary greatly and directly depends on the size of your cluster and the level of activity that uses MapReduce. It might be necessary to adjust and readjust the amount of storage as you determine the "stable state" and "burst state" of the MapReduce activity in your cluster.

For example, consider the following test cluster and usage:

  • A simulated 1000-host cluster, each host with 32 slots
  • MapReduce jobs with 200 attempts (tasks) per activity (job)

Sizing observations for this cluster:

  • Each attempt takes 10 minutes to complete.
  • This usage results in roughly 20 thousand jobs a day with approximately 5 million total attempts.
  • For a retention period of 7 days, this Activity Monitor database required 200 GB.
Table 3. Cloudera Management Service - Service Monitor Configuration
Configuration Topic Service Monitor Configuration
Default Storage Location /var/lib/cloudera-service-monitor/ on the host where the Service Monitor role is configured to run.
Storage Configuration Defaults / Minimum / Maximum
  • 10 GiB Services Time Series Storage
  • 1 GiB Impala Query Storage
  • 1 GiB YARN Application Storage

Total: ~12 GiB Minimum (No Maximum)

Where to Control Data Retention or Size

Service Monitor data growth is controlled by configuring the maximum amount of storage space it can use.

To configure data retention in Cloudera Manager Administration Console:

  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Service Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the propertyName property or search for it by typing its name in the Search box.
    Time-Series Storage

    The approximate amount of disk space dedicated to storing time series and health data. When the store has reached its maximum size, it deletes older data to make room for newer data. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    Note that Cloudera Manager stores time-series data at a number of different data granularities, and these granularities have different effective retention periods. The Service Monitor stores metric data not only as raw data points but also as ten-minute, hourly, six-hourly, daily, and weekly summary data points. Raw data consumes the bulk of the allocated storage space and weekly summaries consume the least. Raw data is retained for the shortest amount of time while weekly summary points are unlikely to ever be deleted.

    Select Cloudera Management Service > Charts Library tab in Cloudera Manager for information about how space is consumed within the Service Monitor. These pre-built charts also show information about the amount of data retained and time window covered by each data granularity.

    Impala Storage

    The approximate amount of disk space dedicated to storing Impala query data. When the store reaches its maximum size, it deletes older data to make room for newer queries. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    YARN Storage

    The approximate amount of disk space dedicated to storing YARN application data. When the store reaches its maximum size, it deletes older data to make room for newer applications. The disk usage is approximate because Cloudera Manager only begins deleting data when it reaches the limit.

  6. Enter a Reason for change, and then click Save Changes to commit the changes.
Sizing, Planning, and Best Practices The Service Monitor gathers metrics about configured roles and services in your cluster and also runs active health tests. These health tests run regardless of idle and use periods, because they are always relevant. The Service Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow, even in an idle cluster.
Table 4. Cloudera Management Service - Host Monitor
Configuration Topic Host Monitor Configuration
Default Storage Location /var/lib/cloudera-host-monitor/ on the host where the Host Monitor role is configured to run.
Storage Configuration Defaults / Minimum/ Maximum Default (and minimum): 10 GiB Host Time Series Storage
Where to Control Data Retention or Size Host Monitor data growth is controlled by configuring the maximum amount of storage space it can use.

See Data Storage for Monitoring Data.

To configure these data retention configuration properties in the Cloudera Manager Administration Console:
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Host Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate each property or search for it by typing its name in the Search box.
    Time-Series Storage

    The approximate amount of disk space dedicated to storing time series and health data. When the store reaches its maximum size, it deletes older data to make room for newer data. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    Note that Cloudera Manager stores time-series data at a number of different data granularities, and these granularities have different effective retention periods. Host Monitor stores metric data not only as raw data points but also as summaries of ten minute, one hour, six hour, one day, and one week increments. Raw data consumes the bulk of the allocated storage space and weekly summaries consume the least. Raw data is retained for the shortest amount of time, while weekly summary points are unlikely to ever be deleted.

    See the Cloudera Management Service > Charts Library tab in Cloudera Manager for information on how space is consumed within the Host Monitor. These pre-built charts also show information about the amount of data retained and the time window covered by each data granularity.

  6. Enter a Reason for change, and then click Save Changes to commit the changes.
Sizing, Planning and Best Practices The Host Monitor gathers metrics about host-level items of interest (for example: disk space usage, RAM, CPU usage, swapping, etc) and also informs host health tests. The Host Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow fairly linearly, even in an idle cluster.
Table 5. Cloudera Management Service - Event Server
Configuration Topic Event Server Configuration
Default Storage Location /var/lib/cloudera-scm-eventserver/ on the host where the Event Server role is configured to run.
Storage Configuration Defaults 5,000,000 events retained
Where to Control Data Retention or Minimum /Maximum

The amount of storage space the Event Server uses is influenced by configuring how many discrete events it can retain.

To configure data retention in Cloudera Manager Administration Console,
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Event Server or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Edit the following property:
    Maximum Number of Events in the Event Server Store

    The maximum size of the Event Server store, in events. When this size is exceeded, events are deleted starting with the oldest first until the size of the store is below this threshold

  6. Enter a Reason for change, and then click Save Changes to commit the changes.
Sizing, Planning, and Best Practices

The Event Server is a managed Lucene index that collects relevant events that happen within your cluster, such as results of health tests, log events that are created when a log entry matches a set of rules for identifying messages of interest and makes them available for searching, filtering and additional action. You can view and filter events on the Diagnostics > Events tab of the Cloudera Manager Administration Console. You can also poll this data using the Cloudera Manager API.

Table 6. Cloudera Management Service - Reports Manager
Configuration Topic Reports Manager Configuration
Default Storage Location RDBMS:

Any Supported RDBMS.

Disk:

/var/lib/cloudera-scm-headlamp/ on the host where the Reports Manager role is configured to run.

Storage Configuration Defaults

RDBMS:

There are no configurable parameters to directly control the size of this data set.

Disk:

There are no configurable parameters to directly control the size of this data set. The storage utilization depends not only on the size of the HDFS fsimage, but also on the HDFS file path complexity. Longer file paths contribute to more space utilization.

Where to Control Data Retention or Minimum / Maximum

The Reports Manager uses space in two main locations: on the Reports Manager host and on its supporting database. Cloudera recommends that the database be on a separate host from the Reports Manager host for process isolation and performance.

Sizing, Planning, and Best Practices Reports Manager downloads the fsimage from the NameNode (every 60 minutes by default) and stores it locally to perform operations against, including indexing the HDFS filesystem structure. More files and directories results in a larger fsimage, which consumes more disk space.

Reports Manager has no control over the size of the fsimage. If your total HDFS usage trends upward notably or you add excessively long paths in HDFS, it might be necessary to revisit and adjust the amount of local storage allocated to the Reports Manager. Periodically monitor, review, and adjust the local storage allocation.

Table 7. Cloudera Navigator - Navigator Audit Server
Configuration Topic Navigator Audit Server Configuration
Default Storage Location Any Supported RDBMS.
Storage Configuration Defaults Default: 90 Days retention
Where to Control Data Retention or Min/Max Navigator Audit Server storage usage is controlled by configuring how many days of data it can retain. Any older data is purged.

To configure data retention in the Cloudera Manager Administration Console:

  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Navigator Audit Server or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the Navigator Audit Server Data Expiration Period property or search for it by typing its name in the Search box.
    Navigator Audit Server Data Expiration Period
    In Navigator Audit Server, purge audit data of various auditable services when the data reaches this age in days. By default, Navigator Audit Server keeps data about audits for 90 days.
  6. Click Save Changes to commit the changes.
Sizing, Planning, and Best Practices The size of the Navigator Audit Server database directly depends on the number of audit events the cluster’s audited services generate. Normally the volume of HDFS audits exceeds the volume of other audits (all other components like MRv1, Hive and Impala read from HDFS, which generates additional audit events).

The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster of 50 hosts with ~100K audit events generated per hour, the Navigator Audit Server database would consume ~2.5 GB per day. To retain 90 days of audits at that level, plan for a database size of around 250 GB. If other configured cluster services generate roughly the same amount of data as the HDFS audits, plan for the Navigator Audit Server database to require around 500 GB of storage for 90 days of data.

Notes:

  • Individual Hive and Impala queries themselves can be very large. Since the query itself is part of an audit event, such audit events consume space in proportion to the length of the query.
  • The amount of space required increases as activity on the cluster increases. In some cases, Navigator Audit Server databases can exceed 1 TB for 90 days of audit events. Benchmark your cluster periodically and adjust accordingly.

To map Cloudera Navigator versions to Cloudera Manager versions, see Product Compatibility Matrix for Cloudera Navigator.

Table 8. Cloudera Navigator - Navigator Metadata Server
Configuration Topic Navigator Metadata Server Configuration
Default Storage Location

RDBMS:

Any Supported RDBMS.

Disk:

/var/lib/cloudera-scm-navigator/ on the host where the Navigator Metadata Server role is configured to run.

Storage Configuration Defaults

RDBMS:

There are no exposed defaults or configurations to directly cull or purge the size of this data set.

Disk:

There are no configuration defaults to influence the size of this location. You can change the location itself with the Navigator Metadata Server Storage Dir property. The size of the data in this location depends on the amount of metadata in the system (HDFS fsimage size, Hive Metastore size) and activity on the system (the number of MapReduce Jobs run, Hive queries executed, etc).

Where to Control Data Retention or Min/Max

RDBMS:

The Navigator Metadata Server database should be carefully tuned to support large volumes of metadata.

Disk:

The Navigator Metadata Server index (an embedded Solr instance) can consume lots of disk space at the location specified for the Navigator Metadata Server Storage Dir property. Ongoing maintenance tasks include purging metadata from the system.

Sizing, Planning, and Best Practices

Memory:

See Navigator Metadata Server Tuning.

RDBMS:

The database is used to store policies and authorization data. The dataset is small, but this database is also used during a Solr schema upgrade, where Solr documents are extracted and inserted again in Solr. This has same space requirements as above use case, but the space is only used temporarily during product upgrades.

Use the product compatibility matrix to map Cloudera Navigator and Cloudera Manager versions.

Disk:

This filesystem location contains all the metadata that is extracted from managed clusters. The data is stored in Solr, so this is the location where Solr stores its index and documents. Depending on the size of the cluster, this data can occupy tens of gigabytes. A guideline is to look at the size of HDFS fsimage and allocate two to three times that size as the initial size. The data here is incremental and continues to grow as activity is performed on the cluster. The rate of growth can be on order of tens of megabytes per day.

General Performance Notes

When possible:

  • For entities that use an RDBMS, install the database on a separate host from the service, and consolidate roles that use databases on as few servers as possible.

  • Provide a dedicated spindle to the RDBMS or datastore data directory to avoid disk contention with other read/write activity.