Migration prerequisites

You need to know the prerequisites for migrating HDFS data from HDP to CDP.

You must meet the following general prerequisites before starting the migration process:

  • You can log in as the “hdfs” user to Ambari in the HDP cluster.

  • Set up line-of-sight from the HDP cluster to the AWS endpoint.

  • Ensure adequate network connectivity between the HDP cluster and AWS S3 endpoint.

  • You can add cloud credentials and configure them in one of the two ways:

    • Add them into the core-site.xml configuration.

    • Provide as a runtime parameter when running the job (that job can be DistCp, Spark job and others).

  • Make sure the HDFS service ports are configured.

  • As the migration of HDFS data to CDP SaaS is directly to S3 via a push mechanism, the key network requirement is to ensure that the YARN / HDFS nodes have network access to the AWS S3 endpoints (port 80 or port 443 for TLS).

  • Must allocate compute resources to execute YARN (DistCp) jobs to transfer the HDFS data to the S3 bucket.

  • Any user running the DistCp job should not be a banned user and must be a part of a supergroup on each host in the cluster.

  • Since there is no superuser on each host, the administrator needs to unban ‘hdfs’ user in the HDP cluster. For more information, see Unbanning hdfs user in HDP cluster.

Once fulfilling migration prerequisites, you set up security for communications between the HDP cluster and the destination cloud services.

Hadoop DistCp tool is used to migrate the HDFS data (and metadata) from an on-prem HDP cluster to the cloud storage (AWS S3) used in the CDP environment.