Migrating Projects

You can migrate individual projects from CDSW 1.10.x to Cloudera AI both on cloud and on premises environments.

The Cloudera AI command-line utility seamlessly migrates projects across different environments, such as from CDSW to Cloudera AI. The migration includes project files stored on NFS, project settings, models, jobs, and applications. The utility currently supports migration from CDSW 1.10 to Cloudera AI public cloud version 2.0.40 onwards, as well as from CDSW 1.10 to Cloudera AI on premises versions 2.0.39 (1.5.1-CHF1) onwards.

Project migration requires a third machine (user laptop or bastion host) with connectivity to both CDSW and Cloudera AI. The project is first downloaded from CDSW to this intermediate machine using the export command, and then it is uploaded to the Cloudera AI Workbench using the import command. The utility uses the cdswctl client for login and creation of an ssh session and tunnel. It uses rsync for migration of project files through this tunnel. Project artifacts (models/jobs/applications) are migrated using APIs.

Authentication is carried out using the API key provided during migration, and only authorized users are allowed to migrate projects. The data in transit will remain encrypted as long as the workbench has an https connection.

Prerequisites🔗

A third machine (user laptop or bastion host) shall have the following configuration:

Unix-like system (macOS or Linux).
Connectivity to both CDSW and Cloudera AI.
Rsync should be installed.
Python version >=3.10.
Sufficient disk space to hold project contents. For stronger security, the disk and/or file system should be encrypted.
Set up custom CA certificates if required (Check FAQ section).

Migration prerequisites:

All custom runtimes shall be configured on the target workbench prior to running project migration.
All the users/collaborators/teams should be migrated to the target workbench, and appropriate access should be provided to these users.
The user migrating the project should be an Admin/Owner/Collaborator of the project. Note that the user migrating the project will become owner of the project in the target Cloudera AI Workbench.
Migration of projects created in a team context is supported, provided the team is already created in the target workbench. Each team member has the right to migrate such projects.
Please note that settings or configurations outside the project are not migrated. These include:
- User quota and custom quota
- Security configurations
- SMTP settings
- Public ssh keys
- Ephemeral storage settings
- Kerberos configuration
- Environment variables
- Engine/Runtime configuration

The steps enumerated below apply to both the source and target workbenches:

Make sure that you have an rsync-enabled runtime image (cloudera/ml-runtime-workbench-python3.9-standard-rsync) added to your Cloudera AI Workbench runtime catalog. The image is hosted on DockerHub. If the rsync image is not readily accessible, it can be created from a Dockerfile hosted here and hosted in any registry. If not, ask your workbench admin to prepare the Dockerfile.
Make sure that your default ssh public keys are added under your user settings. If not, please add them in User Settings > Remote Editing > SSH public keys for session access. The default ssh public key shall be available on your machine at ~/.ssh/\<something\>.pub. The SSH public key must be uploaded to both Cloudera AI or CDSW source and target workbenches.
If an ssh key pair does not already exist in your system, create one.
It is recommended to avoid setting a passphrase for the ssh key, because there are multiple instances in which an ssh connection is established. If a passphrase is set, automation using the utility would be tedious.
The Legacy API key must be available and noted down. You will need this API key during migration. To generate a new key, head over to User Settings > API Keys > Legacy API key.

Legacy engine migration🔗

Projects using the legacy engine can be migrated to engine-based projects by moving the legacy engine to ML Runtime mapping in the cmlutils/constants.py file. The LEGACY_ENGINE_MAP in the constants.py file shall be an empty map for this. The workloads that uses the legacy engine or custom engine images will be migrated to the default engine image in the destination cluster.

Steps for project migration🔗

Install the utility on a third machine or bastion host: python3 -m pip install git+https://github.com/cloudera/cmlutils@main
To export the project, create the export-config.ini file inside the <home-dir>/.cmlutils directory.
Inside the export-config.ini file, create a section for each project, where you can include project-specific configurations. For common configurations shared across projects, place them in the DEFAULT section.
```
[DEFAULT]
url=<Source-Workspace-url>
output_dir=~/Documents/temp_dir
ca_path=~/Documents/custom-ca-source.pem
       
[Project-A]
username=user-1
apiv1_key=umxma76ilel6pgm36zacrx2bywakflvz
       
[Project-B]
username=user-2
apiv1_key=fmxma76ilel6pgm36zacrx2bywaklopq
```
Configuration used:
- username: username of the user who is migrating the project. (Mandatory)
- url: Source workbench URL (Mandatory)
- apiv1_key: Source API v1/Legacy API key (Mandatory)
- output_dir: temporary directory on the local machine where the project data/metadata would be stored. (Mandatory)
- ca_path: path to a CA (Certifying Authority) bundle to use, in case python is not able to pick up CA from the system and ssl certificate verification fails. Issue is generally seen with MacOS. (Optional)
If you wish to skip certain files or directories during export, create .exportignore file at the root of the CDSW project (/home/cdsw). The .exportignore file follows the same semantics as that of .gitgnore.
Run the following project export command
```
cmlutil project export -p "Project-A"
```
or
```
cmlutil project export -p "Project-B"
```
note
The project name shall match one of the section names in the export-config.ini file.
A folder with the project name is created inside the output directory (~/Documents/temp_dir). If the project folder already exists, then the data is overwritten.
All the project files, artifacts and logs corresponding to the project are downloaded in the project folder.
Create the import-config.ini file inside the <home-dir>/.cmlutils directory.
Inside the import-config.ini file, create a section for each project, where you can include project-specific configurations. Place common configurations shared across projects in the DEFAULT section.
Example file:
```
[DEFAULT]
url=<Destination-Workspace-url>
output_dir=~/Documents/temp_dir
ca_path=~/Documents/custom-ca-target.pem
       
[Project-A]
username=user-1
apiv1_key=abcma76ilel6pgm36zacrx2bywakflvz
       
[Project-B]
username=user-2
apiv1_key=xyzma76ilel6pgm36zacrx2bywaklopq
```
Configuration used:
- username: username of the user who is migrating the project. (Mandatory)
- url: Target workbench URL (Mandatory)
- apiv1_key: Target API v1/Legacy API key (Mandatory)
- output_dir: temporary directory on the local machine from where the project will be uploaded. (Mandatory)
- ca_path: path to a CA (Certifying Authority) bundle to use, in case python is not able to pick up CA from the system and ssl certificate verification fails. Issue is generally seen with macOS. (Optional)
Run the following project import command
```
cmlutil project import -p "Project-A"
```
or
```
cmlutil project import -p "Project-B"
```
note
The project name shall match one of the section names in the import-config.ini file.
The project is created in the destination workbench, if it does not exist already. Projects with the same name are overwritten.

Post migration guidelines🔗

In the target workbench, the user's public SSH key will be different from the source. Remember to update the SSH key in all external locations, such as the github repository.
After the migration, the Model API key and endpoint URL are different from the source. Ensure that you update all applications that utilize these APIs.
All the Models/Jobs/applications are created in paused or stopped state in the destination workbench, so all the artifacts shall be restarted post migration. Before starting the Models, Jobs or Application in the destination workbench, the corresponding workloads shall be stopped in the source workbench to avoid any data corruption if both are accessing the same data.
Any configuration parameters outside the project configuration shall be copied manually after the migration.

Batch migration🔗

The Cloudera AI Utility is primarily designed to facilitate the migration of individual projects. However, there is a wrapper script available that enables batch migration of multiple projects. Two Python scripts are available, one for export and another for import.
The batch migration script reads the list of project names from the export-config.ini and import-config.ini files. Each section defined here corresponds to a specific project, with the section name corresponding to the project name. You can include project-specific configurations within each respective section, while configurations shared across multiple projects can be placed inside the "default" section.
The BatchSize variable provided inside the script controls the number of projects that can be exported or imported simultaneously. To prevent system errors like running out of memory, it is essential to select an appropriate batch size. Each export or import operation of a project generates a distinct session on the workbench, utilizing 1 CPU and 0.5 GB of memory. Therefore, the batch size shall be determined considering the available resources on both the source and target workbenches.
Before initiating the batch migration, ensure that enough disk space is available on the host machine for downloading all or a batch of projects.
In case of failure during batch migration, the script can be rerun. However, to speed up the execution of the batch it is recommended to delete all the project-names already exported or imported from the configuration file.
Logs for each project are collected inside the individual project directory.