Moving data from HDFS to Ozone using the distcp command

Use the hadoop distcp command to move the content from the HDFS source cluster.

You must consider the following before running the distcp command:
  • Execute the distcp command from the destination cluster.
  • Ensure that the distcp user can run a MapReduce job on YARN. Otherwise, you must tweak the following configurations to enable the distcp user:
    • allowed.system.users
    • banned.users
    • min.user.id
  • If the source directories have a high file count, you can create a manual copy listing as specified in the following example.
    > hdfs dfs -ls hdfs://<hdfs-nameservice>/user/john.doe/application1/* > src_files

    The copy listing output file can be read and submitted as input one by one to a distcp job.

Consider the example of a user john.doe whose data is from the /user/john.doe/application1/ directory and you want to transfer to Ozone, run the distcp command as specified.
> hadoop distcp -direct hdfs://<hdfs-nameservice>/user/john.doe/application1 ofs://<ozone.service.id>/user/john.doe/
For example, to distcp from a Kerberized CDP HDFS cluster ns1 /tmp directory to a Kerberized CDP Ozone cluster ozone1 under volume v1, bucket b1, directory /dst, execute the following command at a destination cluster host:
hadoop distcp \
-Ddfs.checksum.combine.mode=COMPOSITE_CRC \ -Dozone.client.checksum.type=CRC32C \
-Dozone.om.kerberos.principal.pattern=* \
 hdfs://ns1/tmp/ \
 ofs://ozone1707264383/v1/b1/dst/