Cloudera Storage Optimizer overview

Learn about Cloudera Storage Optimizer and it's benefits.

Cloudera Storage Optimizer is an intelligent data lifecycle management feature that automatically reduces storage usage by converting infrequently accessed data from replicated storage (RATIS 3×) to space-efficient erasure coding (EC). It analyzes access patterns and applies configurable policies to identify and convert cold data, reducing the storage footprint by 45% to 60% while maintaining data durability and availability. The benefit is additional usable capacity for the same licensed capacity.

Advantages

The Cloudera Storage Optimizer continuously monitors your data access patterns and automatically transitions cold data to more efficient storage formats allowing you to optimize the following:

  • Storage reduction: Converts 3x replicated data to EC format with only 40 to 50 percent of overhead.

    Example,

    For 100 TB of data where 70% is cold (Data Under Management unchanged):

    • Before (3× replication): 100 TB × 3 = 300 TB storage used
    • After (3× replication):
      • Hot data of 30 TB at 3x replication: 30*3 = 90 TB
      • Cold data of 70 TB at EC of approximately 1.5x: 70*1.5 = 105 TB
      • Total storage used after conversion: 90 + 105 = 195 TB
      • Total storage freed after conversion: 300 - 195 = 105 TB (approximately 35 percent of storage usage saved)
  • Cost savings: Reduces upto 45 to 60 percent of storage costs for cold data.
  • Automated management: No manual intervention is required once configured.
  • Compliance ready: Maintains full data durability and audit trails.
  • Performance optimized: Hot data remains in high-performance RATIS format.

Limitations

Cloudera Storage Optimizer supports only one-way conversion from RATIS to EC for your keys. Once they are converted to EC keys, Cloudera Storage Optimizer does not support coverting back from EC to RATIS.

How does Cloudera Storage Optimizer works?

The Cloudera Storage Optimizer operates through a sophisticated five-stage pipeline that runs daily to optimize your storage automatically. This builds upon Ozone's existing replication pipeline to provide intelligent data management.

The five-stage processing pipeline works as follows:
  • Audit log aggregation: The system collects and processes Ozone audit logs to understand data access patterns. Every read, write, and delete operation is tracked and aggregated into structured data using Cloudera Hive-on-Tez.
  • Metadata import: The Cloudera Storage Optimizer fetches comprehensive metadata about all storage objects from Ozone Recon API, including file sizes, creation dates, and current replication types.
  • Access pattern analysis: By joining metadata with access statistics, the system creates a complete view of your data usage patterns, identifying which files are truly cold. This analysis leverages Cloudera's data engineering capabilities.
  • Candidate selection: Configurable policies determine which files should be converted. For example, files larger than 32 MB that are not accessed in 30 days become conversion candidates.
  • Distributed conversion: MapReduce jobs perform the actual conversion from RATIS to Erasure Coding, processing data in parallel for maximum efficiency using YARN resources. All of this happens in the background without impacting your applications or users.