Apache Kudu background maintenance tasks
Kudu relies on running background tasks for many important maintenance activities.
These tasks include flushing data from memory to disk, compacting data to improve performance,
freeing up disk space, and more.
Maintenance manager The maintenance manager schedules and runs background tasks. At any given point in time, the maintenance manager is prioritizing the next task based on improvements needed at that moment, such as relieving memory pressure, improving read performance, or freeing up disk space.Flushing data to disk Flushing data from memory to disk relieves memory pressure and can improve read performance by switching from a write-optimized, row-oriented in-memory format in the MemRowSet
, to a read-optimized, column-oriented format on disk.Compacting on-disk data Kudu constantly performs several compaction tasks in order to maintain consistent read and write performance over time.Write-ahead log garbage collection Kudu maintains a write-ahead log (WAL) per tablet that is split into discrete fixed-size segments. A tablet periodically rolls the WAL to a new log segment when the active segment reaches a size threshold (configured by the --log_segment_size_mb
property).Tablet history garbage collection and the ancient history mark Kudu uses a multiversion concurrency control (MVCC) mechanism to ensure that snapshot scans can proceed isolated from new changes to a table. Therefore, periodically, old historical data should be garbage-collected (removed) to free up disk space. While Kudu never removes rows or data that are visible in the latest version of the data, Kudu does remove records of old changes that are no longer visible.