Hadoop archive components
You can use the Hadoop archiving tool to create Hadoop Archives (HAR). The Hadoop Archive is integrated with the Hadoop file system interface. Files in a HAR are exposed transparently to users. File data in a HAR is stored in multipart files, which are indexed to retain the original separation of data.
Hadoop archiving tool
Hadoop Archives can be created using the Hadoop archiving tool. The archiving tool uses MapReduce to efficiently create Hadoop Archives in parallel. The tool can be invoked using the command:
hadoop archive -archiveName name -p <parent> <src>* <dest>
A list of files is generated by traversing the source directories recursively, and then the list is split into map task inputs. Each map task creates a part file (about 2 GB, configurable) from a subset of the source files and outputs the metadata. Finally, a reduce task collects metadata and generates the index files.
HAR file system
Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the archives must be expanded before use.
The Hadoop Archive is integrated with the Hadoop file system interface. The
HarFileSystem
implements the FileSystem
interface
and provides access via the har://
scheme. This exposes the archived
files and directory tree structures transparently to users. Files in a HAR can be
accessed directly without expanding them.
For example, if we have the following command to copy an HDFS file to a local directory:
hdfs dfs –get hdfs://namenode/foo/file-1 localdir
Suppose a Hadoop Archive bar.har
is created from the
foo
directory. With the HAR, the command to copy the original file
becomes:
hdfs dfs –get har://namenode/bar.har/foo/file-1 localdir
Users only need to change the URI paths. Alternatively, users may choose to create a
symbolic link (from hdfs://namenode/foo
to
har://namenode/bar.har/foo
in the example above), and then even the
URIs do not need to be changed. In either case, HarFileSystem
will be
invoked automatically to provide access to the files in the HAR. Because of this
transparent layer, HAR is compatible with the Hadoop APIs, MapReduce, the FS shell
command-line interface, and higher-level applications such as Pig, Zebra, Streaming,
Pipes, and DistCp.
HAR format data model
The Hadoop Archive data format has the following layout:
foo.har/_masterindex //stores hashes and offsets foo.har/_index //stores file statuses foo.har/part-[1..n] //stores actual file data
The file data is stored in multipart files, which are indexed in order to retain the original separation of data. Moreover, the file parts can be accessed in parallel by MapReduce programs. The index files also record the original directory tree structures and file status.