Configuring snapshots

Learn about snapshots and configuring snapshot behavior in Cloudera Surveyor. Snapshots control how frequently data is presented on the UI.

About snapshots

Cloudera Surveyor automatically collects snapshots of Kafka cluster data at configured intervals. These snapshots capture the current state of topics, partitions, consumer groups, and other cluster metadata. Most data presented on the UI is based on these snapshots.

Because of snapshots, the majority of changes in Kafka clusters only appear on the UI after the next scheduled snapshot is taken. For example, if snapshots occur every 10 minutes, a new topic created in Kafka might not be visible in the UI for up to 10 minutes.

The snapshot system provides multiple configuration settings to control data collection intervals, timeouts, and resource usage. Configuration enables you to fine-tune data collection behavior. For example, more frequent snapshots provide fresher data but consume more resources, while less frequent snapshots reduce resource usage but may show older data.

Configuration levels and time format

Most snapshot settings can be configured at two levels:

  • Global settings – Apply to all clusters by default. Configured using surveyorConfig.surveyor.* properties.

  • Per-cluster overrides – Override global settings for specific clusters. Configured using clusterConfigs.clusters[*].* properties.

Per-cluster settings take precedence over global settings, allowing you to customize behavior for individual clusters while maintaining sensible defaults for all others.

All duration and interval values throughout the snapshot configuration are specified in ISO-8601 format. For example, PT5M for 5 minutes, PT1H for 1 hour, and PT30S for 30 seconds.

Configuring snapshot intervals

Configure how frequently Cloudera Surveyor collects snapshots from registered Kafka clusters.

Snapshot intervals determine how frequently Cloudera Surveyor collects data from Kafka clusters. The snapshot interval has a direct effect on how fresh the data is that is presented on the UI.

Configure the snapshot interval globally for all clusters using surveyorConfig.surveyor.globalSnapshotInterval or on a per cluster basis using clusterConfigs.clusters[*].snapshotInterval. High-activity clusters might benefit from more frequent snapshots, while stable clusters can use longer intervals to reduce resource usage.

The following example configures a global snapshot interval of 10 minutes and configures per cluster overrides for a production and development cluster.

#...
surveyorConfig:
  surveyor:
    globalSnapshotInterval: PT10M

clusterConfigs:
  clusters:
    - clusterName: "production-cluster"
      bootstrapServers: "prod-kafka-1:9092,prod-kafka-2:9092"
      snapshotInterval: PT5M
    - clusterName: "development-cluster"
      bootstrapServers: "dev-kafka:9092"
      snapshotInterval: PT30M
    - clusterName: "staging-cluster"
      bootstrapServers: "staging-kafka:9092"
The production-cluster has a decreased interval for more frequent updates, the development-cluster has a much longer interval, while the staging-cluster has no overrides and uses the global default.

Configuring snapshot reliability settings

Configure advanced snapshot settings to optimize reliability and performance for clusters with specific network conditions, resource constraints, or availability requirements.

Beyond basic snapshot intervals, Cloudera Surveyor provides additional properties to handle various operational scenarios. Use these settings to fine-tune snapshotting behavior. Timeouts prevent hanging on slow clusters, time-to-live (TTL) settings maintain data availability during temporary failures, and resource management settings control system load.

  • Configure snapshot timeout settings with surveyorConfig.surveyor.globalSnapshotTimeout globally or with clusterConfigs.clusters[*].snapshotTimeout per cluster to handle slow or unresponsive clusters.

    Snapshot timeouts prevent Cloudera Surveyor from waiting indefinitely for slow clusters. Per-cluster overrides are useful for clusters that consistently need more time. For example:

    #...
    surveyorConfig:
      surveyor:
        globalSnapshotTimeout: PT2M
    
    clusterConfigs:
      clusters:
        - clusterName: "slow-cluster"
          bootstrapServers: "slow-kafka:9092"
          snapshotTimeout: PT5M
  • Configure snapshot TTL settings with surveyorConfig.surveyor.globalSnapshotTtl globally or with clusterConfigs.clusters[*].snapshotTtl per cluster to handle intermittent failures.

    TTL settings determine how long to keep the last successful snapshot when subsequent snapshots fail. Per-cluster overrides are useful for clusters with intermittent connectivity issues. For example:

    #...
    surveyorConfig:
      surveyor:
        globalSnapshotTtl: PT15M
    
    clusterConfigs:
      clusters:
        - clusterName: "unstable-cluster"
          bootstrapServers: "unstable-kafka:9092"
          snapshotTtl: PT30M
  • Configure global resource management settings with surveyorConfig.surveyor.maxGlobalSnapshotParallelism and with surveyorConfig.surveyor.snapshotMaxJitter to control system-wide snapshot processing.

    These properties control system-wide resources and apply to all clusters. They are global-only and cannot be configured per cluster. The maxGlobalSnapshotParallelism property limits the number of threads for processing snapshots across all clusters. The snapshotMaxJitter property adds an initial delay to distribute load when managing many clusters. For example:

    #...
    surveyorConfig:
      surveyor:
        maxGlobalSnapshotParallelism: 4
        snapshotMaxJitter: PT30S