ML WorkspacesPDF version

Upgrading CML workspaces version 1.4.1 to 1.5.0 on OCP

When you upgrade from Private Cloud version 1.4.1 to version 1.5.0, you need to manually upgrade Machine Learning workspaces that are running on OpenShift Container Platform using internal Network File System.

In OpenShift Container Platform Private Cloud 1.5.0, the internal Network File System implementation is changed from using an Network File System provisioner for each workspace, to using a CephFS Volume.

On either Cloudera Embedded Container Service or OpenShift Container Platform, internal workspaces on Private Cloud 1.4.0/1.4.1 use the Network File System server provisioner as a storage provisioner. This server provisioner still works in 1.5.0, however, it is deprecated, and will be removed in 1.5.1.

Existing workspaces in 1.4.1 need to be upgraded to 1.5.0. These workspaces use the older storage provisioner. You can do one of the following:
  • Migrate the workspace to CephFS before 1.5.1 is released, or:
  • Create a new 1.5.0 workspace, and migrate the workloads to that workspace now.

The manual steps are required if an existing workspace backed by internal Network File System (which was created on Private Cloud 1.4.1 or below) needs to be migrated to CephFS RWX (read, write, many).

  1. Update OpenShift Container Platform Private Cloud to version 1.5.0.
  2. Each existing Machine Learning workspace can now be upgraded, although this is optional. If you want to continue using your existing workspaces without upgrading them, then this procedure is not required. This is true for all existing workspaces (both internal and external Network File System).
  3. If you want to upgrade a workspace, determine first whether the workspace is backed by internal or external Network File System.
    1. If the existing workspace is backed by external Network File System, you can upgrade the workspace from the UI. There is no need to follow the rest of this procedure.
    2. If the existing workspace is backed by internal Network File System, follow this procedure to migrate to CephFS after the workspace upgrade.
  4. Upgrade the workspace from Cloudera Machine Learning UI.
  5. Get the Kubeconfig for your Private Cloud cluster.
  6. Suspend the workspace manually so that there are no ongoing read/write operations to the underlying Network File System. Stop all your running workloads - sessions, jobs, application, deployments and so forth. Also, scale down ds-vfs and s2i-client deployments with these commands:
    1. kubectl scale -n <workspace-namespace> --replicas=0 deployment ds-vfs
    2. kubectl scale -n <workspace-namespace> --replicas=0 deployment s2i-client
  7. Create a backup volume for the upgrade process. The backup can either be taken in the cluster itself or it can also be taken outside in an external Network File System. Substitute your workspace details where indicated with angle brackets. Start by creating a backup.yaml file. Add the following content to the file and run it using the command: kubectl apply -f ./backup.yaml
    1. Internal backup:
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: projects-pvc-backup
        namespace: <existing-workspace-namespace>
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 1 Ti
        storageClassName: ocs-storagecluster-cephfs
    2. External backup:
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: projects-pvc-backup
      spec:
        capacity:
          storage: 1 Ti
        accessModes:
          - ReadWriteMany
        persistentVolumeReclaimPolicy: Retain
        mountOptions:
          - nfsvers=3
        nfs:
          server: <your-external-nfs-address>
          path: <your-external-nfs-export-path>
        volumeMode: Filesystem
      
      ---
      
      kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: projects-pvc-backup
        namespace: <existing-workspace-namespace>
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 1 Ti
        storageClassName: ""
        volumeName: projects-pvc-backup
        volumeMode: Filesystem
      
  8. Create a migrate.yaml file. Add the following content to the file. With the following Kubernetes job, create a backup of the existing workspace’s Network File System data to the volume that was created in the previous step. Run the job using the command: kubectl apply -f ./migrate.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
      namespace: <existing-workspace-namespace>
      name: projects-pvc-backup
    spec:
      completions: 1
      parallelism: 1
      backoffLimit: 10
      template:
        metadata:
          name: projects-pvc-backup
          labels:
            name: projects-pvc-backup
        spec:
          restartPolicy: Never
          containers:
            - name: projects-pvc-backup
              image: docker-private.infra.cloudera.com/cloudera_base/ubi8/cldr-ubi-minimal:8.6-751-fips-03062022
              tty: true
              command: [ "/bin/sh" ]
              args: [  "-c", "microdnf install rsync && rsync -P -a /mnt/old/ /mnt/new && chown -R 8536:8536 /mnt/new;" ]
              volumeMounts:
                - name: old-vol
                  mountPath: /mnt/old
                - name: new-vol
                  mountPath: /mnt/new
          volumes:
            - name: old-vol
              persistentVolumeClaim:
                claimName: projects-pvc 
            - name: new-vol
              persistentVolumeClaim:
                claimName: projects-pvc-backup
    
  9. Monitor the previous job for completion. Logs can be retrieved using:
    kubectl logs -n <workspace-namespace> -l job-name=projects-pvc-backup
    You can check for job completion with:
    kubectl get jobs -n <workspace-namespace> -l job-name=projects-pvc-backup
  10. Delete the existing Network File System volume for the workspace.
    kubectl delete pvc -n <workspace-namespace> projects-pvc
    kubectl patch pvc -n <workspace-namespace> projects-pvc -p '{"metadata":{"finalizers":null}}'
  11. Modify the underlying Network File System from Network File System provisioner to CephFS RWX (read, write, many).
    1. Get the release name for the workspace, using: helm list -n <workspace-namespace>. For example, in this case mlx-workspace1 is the release-name.
      helm list -n workspace1
      WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: ./../piyushecs
      WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: ./../piyushecs
      NAME          	NAMESPACE 	REVISION	UPDATED                                	STATUS  	CHART                   	APP VERSION
      mlx-workspace1	workspace1	4       	2023-01-04 08:07:47.075343142 +0000 UTC	deployed	cdsw-combined-2.0.35-b93
    2. Save the existing Helm values.
      helm get values <release-name> -n <workspace-namespace> -o yaml > old.yaml
    3. Modify the ProjectsPVCStorageClassName in the old.yaml file to ocs-storagecluster-cephfs and add ProjectsPVCSize: 1Ti.

      For example:

      ProjectsPVCStorageClassName: longhorn-nfs-sc-workspace1 shall be changed to ProjectsPVCStorageClassName: ocs-storagecluster-cephfs Also, add this to the file: ProjectsPVCSize: 1Ti.

    4. Get the GitSHA from old.yaml: grep GitSHA old.yaml

      For example: GitSHA: 2.0.35-b93

    5. Get the release chart cdsw-combined-<GitSHA>.tgz This is available in dp-mlx-control-plane-app pod in the namespace at folder /app/service/resources/mlx-deploy/ Contact Cloudera support to download the chart if needed.
    6. Delete the jobs and stateful sets (these are recreated after the helm install):
      kubectl --namespace <workspace-namespace> delete jobs  --all
      kubectl --namespace <workspace-namespace> delete statefulsets  --all
    7. Do a Helm upgrade to the same release.
      helm upgrade <release-name> <path to release chart (step e)> --install -f ./old.yaml --wait  --namespace <workspace-namespace> --debug --timeout 1800s
  12. Scale down the ds-vfs and s2i-client deployments with the commands:
    kubectl scale -n <workspace-namespace> --replicas=0 deployment ds-vfs
    kubectl scale -n <workspace-namespace> --replicas=0 deployment s2i-client
  13. Copy the data from the backup into this upgraded workspace. In order to do this, create a migrate2.yaml file. Add the following content to the file. Run the job using the command kubectl apply -f ./migrate2.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
      namespace: <existing-workspace-namespace>
      name: projects-pvc-backup2
    spec:
      completions: 1
      parallelism: 1
      backoffLimit: 10
      template:
        metadata:
          name: projects-pvc-backup2
          labels:
            name: projects-pvc-backup2
        spec:
          restartPolicy: Never
          containers:
            - name: projects-pvc-backup2
              image: docker-private.infra.cloudera.com/cloudera_base/ubi8/cldr-ubi-minimal:8.6-751-fips-03062022
              tty: true
              command: [ "/bin/sh" ]
              args: [ "-c", "microdnf install rsync && rsync -P -a /mnt/old/ /mnt/new && chown -R 8536:8536 /mnt/new;" ]
              volumeMounts:
                - name: old-vol
                  mountPath: /mnt/old
                - name: new-vol
                  mountPath: /mnt/new
          volumes:
            - name: old-vol
              persistentVolumeClaim:
                claimName: projects-pvc-backup 
            - name: new-vol
              persistentVolumeClaim:
                claimName: projects-pvc 
    
  14. Monitor the job above for completion. Logs can be retrieved using:
    kubectl logs -n <workspace-namespace> -l job-name=projects-pvc-backup2
    You can check for job completion with:
    kubectl get jobs -n <workspace-namespace> -l job-name=projects-pvc-backup2
  15. Scale up ds-vfs and s2i-client using the command:
    kubectl scale -n <workspace-namespace> --replicas=1 deployment ds-vfs
    and
    kubectl scale -n <workspace-namespace> --replicas=1 deployment s2i-client
  16. The upgraded workspace is ready to use. In case you want to delete the backup, delete the existing backup volume for the workspace using these commands:
    kubectl delete pvc -n <workspace-namespace> projects-pvc-backup
    kubectl patch pvc -n <workspace-namespace> projects-pvc-backup -p '{"metadata":{"finalizers":null}}'