Creating a Hadoop Archive
The Hadoop archiving tool can be invoked using the following command:
hadoop archive -archiveName name -p <parent> <src>* <dest>
Where -archiveName
is the name of the archive you would like to
create. The archive name should be given a .har extension. The
<parent>
argument is used to specify the relative path
to the location where the files are to be archived in the HAR.
Example
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
This example creates an archive using /user/hadoop
as the
relative archive directory. The directories /user/hadoop/dir1
and
/user/hadoop/dir2
will be archived in the
/user/zoo/foo.har
archive.
Archiving does not delete the source files. If you would like to delete the input files after creating an archive to reduce namespace, you must manually delete the source files.
Although the hadoop archive command can be run from the host file system, the archive file is created in the HDFS file system from directories that exist in HDFS. If you reference a directory on the host file system rather than in HDFS, you will get the following error:
The resolved paths set is empty. Please check whether the srcPaths exist, where srcPaths = [</directory/path>]
To create the HDFS directories used in the preceding example, use the following series of commands:
hdfs dfs -mkdir /user/zoo hdfs dfs -mkdir /user/hadoop hdfs dfs -mkdir /user/hadoop/dir1 hdfs dfs -mkdir /user/hadoop/dir2