Scenario: Heartbeat interval and timeout rate
Learn how you can control the interval between heartbeats and EFM timing out the operation, and resolve the issue of unhealthy agents.
- Scenario
-
You are running a bulk operation and some of your agents have unstable network access. When the router goes down, you need 1 minute to recover. You know that agents are configured with 10s heartbeat interval.
- Analysis
-
In case when not all agents are healthy you are faced with the issue that your bulk operation slows down drastically because of the lack of responses from agents. So the operation is stuck at EFM side and unable to send anything to the agent. In order to control this situation, you can define
efm.monitor.maxHeartbeatInterval=10s
andefm.operation.monitoring.inQueuedStateTimeoutHeartbeatRate=2.0
.The following chart shows you how this situation looks like on a timeline: - Solution
-
You can set
maxHeartbeatInterval=12
because if the agent is online you must receive a heartbeat in 10s but you allow some buffer. The 1 min outage could cause you to miss 6-7 heartbeats. So you setinQueuedStateTimeoutHeartbeatRate=8.
It means EFM waits for 96 seconds before timing out that operation.