Introduction
HDFS supports efficient writes of large data sets to durable storage, and also provides reliable access to the data. This works well for batch jobs that write large amounts of persistent data.
Emerging classes of applications are driving use cases for writing smaller amounts of temporary data. Using DataNode memory as storage addresses the use case of applications that want to write relatively small amounts of intermediate data sets with low latency.
Writing block data to memory reduces durability, as data can be lost due to process restart before it is saved to disk. HDFS attempts to save replica data to disk in a timely manner to reduce the window of possible data loss.
DataNode memory is referenced using the RAM_DISK storage type and the LAZY_PERSIST storage policy.
Using DataNode memory as HDFS storage involves the following steps:
Shut down the DataNode.
Mount a portion of DataNode memory for use by HDFS.
Assign the RAM_DISK storage type to the DataNode, and enable short-circuit reads.
Set the LAZY_PERSIST storage policy on the HDFS files and directories that will use memory as storage.
Restart the DataNode.
If you update a storage policy setting on a file or directory, you must use the
HDFS mover
data migration tool to actually move blocks as
specified by the new storage policy.
Memory as storage represents one aspect of YARN resource management capabilities that includes CPU scheduling, CGroups, node labels, and archival storage.