Copying data with Hadoop DistCp
DistCp (distributed copy) is a tool used to copy files in large inter-cluster and intra-cluster environments. It uses MapReduce to affect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which copy a partition of the files specified in the source list.
The following are some of the examples of distcp commands with object stores:
- Copying between directories in an object
store
~ hadoop distcp abfs://abfscontainer@abfstorageacc.dfs.core.windows.net/myDir/testingFile.txt \ abfs://abfscontainer@abfstorageacc.dfs.core.windows.net/test/ 20/05/21 08:48:09 INFO mapreduce.Job: Job job_1589987399184_0005 completed successfully ~hadoop fs -ls abfs://abfscontainer@abfstorageacc.dfs.core.windows.net/test/ Found 1 items -rw-r--r-- 1 hdfs hdfs 41 2020-05-21 08:48 abfs://abfscontainer@abfstorageacc.dfs.core.windows.net/test/testingFile.txt
- Copying between two different object
stores
~ hadoop distcp abfs://abfscontainer@abfstorageacc.dfs.core.windows.net/myDir/testingFile.txt \ abfs://mycontainer@mystoragehastoexist.dfs.core.windows.net/newtest/ 20/05/21 08:53:26 INFO mapreduce.Job: Job job_1589987399184_0007 completed successfully ~ hadoop fs -ls abfs://mycontainer@mystoragehastoexist.dfs.core.windows.net/newtest/ Found 1 items -rw-r--r-- 1 hdfs hdfs 41 2020-05-21 08:53 abfs://mycontainer@mystoragehastoexist.dfs.core.windows.net/newtest/testingFile.txt
For more information about the DistCp commands, see DistCP documentation.