Sometimes it is possible for the scheduler state to go out of sync from the cluster
state. This may result in pods in Pending and
ApplicationRejected states, with pod events showing Placement
Rule related errors. To recover from this, you may need to refresh the YuniKorn
configuration.
-
Run the following commands to scale down the YuniKorn pods:
kubectl scale deployment yunikorn-admission-controller --replicas=0 -n yunikorn
kubectl scale deployment yunikorn-scheduler --replicas=0 -n yunikorn
The yunikorn-scheduler and
yunikorn-admission-controller pods are managed by the
yunikorn-scheduler and
yunikorn-admission-controller deployments in the
yunikorn namespace, so you can scale down these
deployments to 0.
-
Run the following command to delete the
yunikorn-configs
ConfigMap:
kubectl delete cm yunikorn-configs -n yunikorn
-
Run the following commands to restart the
resource-pool-manager pod:
kubectl scale deployment cdp-release-resource-pool-manager --replicas=0 -n <cdp-namespace>
kubectl scale deployment cdp-release-resource-pool-manager --replicas=1 -n <cdp-namespace>
The resource-pool-manager pod is managed by the
cdp-release-resource-pool-manager deployment in your
CDP control plane namespace, so you can scale that deployment down to 0 and
then scale it back up to 1.
-
Run the following commands to scale up the YuniKorn pods:
kubectl scale deployment yunikorn-scheduler --replicas=1 -n yunikorn
kubectl scale deployment yunikorn-admission-controller --replicas=1 -n yunikorn
The yunikorn-scheduler and
yunikorn-admission-controller pods are managed by the
yunikorn-scheduler and
yunikorn-admission-controller deployments in the
yunikorn namespace, so you can scale up these
deployments to 1.
The preceding steps will refresh the YuniKorn configuration for the applicable
control plane.
After the YuniKorn restart, Pending pods will be picked up and
recovered automatically, but pods left in the ApplicationRejected
state may need to be redeployed. If the pod is managed by a deployment, you can
simply delete the pod. If the pod is unmanaged, you must delete and redeploy the
pod.