Scenario: Heartbeat interval and timeout rate
Learn how you can control the interval between heartbeats and EFM operation timrouts, and resolve the issue of unhealthy agents.
- Scenario
-
You are running a bulk operation and some of your agents have unstable network access. When the router goes down, it takes one minute to recover. You know that agents are configured with 10-second heartbeat interval.
- Analysis
-
When not all agents are healthy, you are faced with the issue that your bulk operation slows down drastically because of the lack of responses from agents. So the operation is stuck on the EFM side, unable to send updates to the agent. To manage this situation, you can define
efm.monitor.maxHeartbeatInterval=10s
andefm.operation.monitoring.inQueuedStateTimeoutHeartbeatRate=2.0
.The following chart shows you how this situation looks over time: - Solution
-
You can set
maxHeartbeatInterval=12
because if the agent is online, you must receive a heartbeat within 10 seconds, but this allows some buffer time. The 1-minute outage could cause you to miss 6-7 heartbeats. So you setinQueuedStateTimeoutHeartbeatRate=8.
, meaning that EFM waits for 96 seconds before timing out the operation.