Scenario: Heartbeat interval and timeout rate

Learn how you can control the interval between heartbeats and EFM timing out the operation, and resolve the issue of unhealthy agents.

Scenario

You are running a bulk operation and some of your agents have unstable network access. When the router goes down, you need 1 minute to recover. You know that agents are configured with 10s heartbeat interval.

Analysis

In case when not all agents are healthy you are faced with the issue that your bulk operation slows down drastically because of the lack of responses from agents. So the operation is stuck at EFM side and unable to send anything to the agent. In order to control this situation, you can define efm.monitor.maxHeartbeatInterval=10s and efm.operation.monitoring.inQueuedStateTimeoutHeartbeatRate=2.0.

The following chart shows you how this situation looks like on a timeline:


Solution

You can set maxHeartbeatInterval=12 because if the agent is online you must receive a heartbeat in 10s but you allow some buffer. The 1 min outage could cause you to miss 6-7 heartbeats. So you set inQueuedStateTimeoutHeartbeatRate=8. It means EFM waits for 96 seconds before timing out that operation.