Scenario: Heartbeat interval and timeout rate

Learn how you can control the interval between heartbeats and EFM operation timrouts, and resolve the issue of unhealthy agents.

Scenario

You are running a bulk operation and some of your agents have unstable network access. When the router goes down, it takes one minute to recover. You know that agents are configured with 10-second heartbeat interval.

Analysis

When not all agents are healthy, you are faced with the issue that your bulk operation slows down drastically because of the lack of responses from agents. So the operation is stuck on the EFM side, unable to send updates to the agent. To manage this situation, you can define efm.monitor.maxHeartbeatInterval=10s and efm.operation.monitoring.inQueuedStateTimeoutHeartbeatRate=2.0.

The following chart shows you how this situation looks over time:


Solution

You can set maxHeartbeatInterval=12 because if the agent is online, you must receive a heartbeat within 10 seconds, but this allows some buffer time. The 1-minute outage could cause you to miss 6-7 heartbeats. So you set inQueuedStateTimeoutHeartbeatRate=8., meaning that EFM waits for 96 seconds before timing out the operation.