ECS Day Two Operations Guide
Overview
Prerequisites
Basic operations
Collecting diagnostic data
Proactive monitoring
Environment health checks
Host-level tasks
Starting, stopping, restarting, and refreshing Embedded Container Service Clusters
Adding hosts to a Embedded Container Service Cluster
Installing NVIDIA GPU software in ECS
Decommissioning ECS Hosts
ECS Server High Availability
Enable ECS Server HA Post ECS Installation
Install iptables on the new ECS master nodes
Adding hosts to the containerized cluster
Adding Role Instances to Docker Server
Adding Role Instances to Containerised Cluster
Starting Docker Server on Nodes
Starting ECS Server on Nodes
Refreshing ECS
Checking Nodes and Pods in the UI
Enable ECS Server HA and promote agents Post ECS Installation
Enabling ECS Server deployment for High Availability
Preparing the cluster for High Availability:
High Level steps to enable an ECS High Availability cluster
Verifying DNS setup
Installing Load Balancer
Promoting ECS Agents to ECS Servers
Refreshing ECS
Create an environment-wide backup
Creating backup of Control Plane and restoring it
Troubleshooting Backup and Restore Manager
CDP Control Plane UI or the Backup and Restore Manager becomes inaccessible after a failed restore event?
Timeout error appears in Backup and Restore Manager?
Stale configurations in Cloudera Manager after a restore event?
Timeout error during backup of OCP clusters
Managing certificates
Adjusting the expiration time of ECS cluster certificates