Sometimes it is possible for the scheduler state to go out of sync from the cluster
state. This may result in pods in Pending
and
ApplicationRejected
states, with pod events showing Placement
Rule
related errors. To recover from this, you may need to refresh the YuniKorn
configuration.
-
Run the following commands to scale down the YuniKorn pods:
kubectl scale deployment yunikorn-admission-controller --replicas=0 -n yunikorn
kubectl scale deployment yunikorn-scheduler --replicas=0 -n yunikorn
The yunikorn-scheduler
and
yunikorn-admission-controller
pods are managed by the
yunikorn-scheduler
and
yunikorn-admission-controller
deployments in the
yunikorn
namespace, so you can scale down these
deployments to 0.
-
Run the following command to delete the
yunikorn-configs
ConfigMap:
kubectl delete cm yunikorn-configs -n yunikorn
-
Run the following commands to restart the
resource-pool-manager
pod:
kubectl scale deployment cdp-release-resource-pool-manager --replicas=0 -n <cdp-namespace>
kubectl scale deployment cdp-release-resource-pool-manager --replicas=1 -n <cdp-namespace>
The resource-pool-manager
pod is managed by the
cdp-release-resource-pool-manager
deployment in your
CDP control plane namespace, so you can scale that deployment down to 0 and
then scale it back up to 1.
-
Run the following commands to scale up the YuniKorn pods:
kubectl scale deployment yunikorn-scheduler --replicas=1 -n yunikorn
kubectl scale deployment yunikorn-admission-controller --replicas=1 -n yunikorn
The yunikorn-scheduler
and
yunikorn-admission-controller
pods are managed by the
yunikorn-scheduler
and
yunikorn-admission-controller
deployments in the
yunikorn
namespace, so you can scale up these
deployments to 1.
The preceding steps will refresh the YuniKorn configuration for the applicable
control plane.
After the YuniKorn restart, Pending
pods will be picked up and
recovered automatically, but pods left in the ApplicationRejected
state may need to be redeployed. If the pod is managed by a deployment, you can
simply delete the pod. If the pod is unmanaged, you must delete and redeploy the
pod.