Creating a Hadoop Archive

The Hadoop archiving tool can be invoked using the following command:

hadoop archive -archiveName name -p <parent> <src>* <dest>

Where -archiveName is the name of the archive you would like to create. The archive name should be given a .har extension. The <parent> argument is used to specify the relative path to the location where the files are to be archived in the HAR.

Example

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

This example creates an archive using /user/hadoop as the relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be archived in the /user/zoo/foo.har archive.

Archiving does not delete the source files. If you would like to delete the input files after creating an archive to reduce namespace, you must manually delete the source files.

Although the hadoop archive command can be run from the host file system, the archive file is created in the HDFS file system from directories that exist in HDFS. If you reference a directory on the host file system rather than in HDFS, you will get the following error:

The resolved paths set is empty. Please check whether the srcPaths exist, where srcPaths
 = [</directory/path>]

To create the HDFS directories used in the preceding example, use the following series of commands:

hdfs dfs -mkdir /user/zoo
hdfs dfs -mkdir /user/hadoop
hdfs dfs -mkdir /user/hadoop/dir1
hdfs dfs -mkdir /user/hadoop/dir2

​Creating a Hadoop Archive

Creating a Hadoop Archive