Watchdog compares it’s own pings, and records the time it takes for a round trip to different components, clustered connections, and if one variable is larger than the other, watchdog will commence closing that stale connection. Here is a breakdown:
- A check is performed of a connection(s) on every watchdog_delay interval.
-
During this check two things occur
- If the last transfer time exceeds max-inactivity-time a stop service command is given to terminate and broadcast unavailable presence.
-
If the last transfer time is lower than max-inactivity-time but exceeds watchdog_timeout watchdog will try to send a ping (of watchdog_ping_type). This ping may be one of two varieties (set in init.properties)
- whitespace ping which will yield the time of the last data transfer in any direction.
- xmpp ping which will yield the time of the last received xmpp stanza.
- If the 2nd option is true, the connection will remain open, and another check will begin after the watchdog_delay time has expired.
For example, lets draw this out and get a visual representation
This line represents how often the check is performed. Each - is 10 seconds, so the check is done every 60 seconds (watchdog_delay=60000)
-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------ | | | | | | | | | | |
This line is client activity, here the client has sent a message at 20 seconds and has gone idle.
-+------------------------------------------------------------------
The following line represents the logic, with settings set at: watchdog_timeout=120000 | +c2s/max-inactivity-time[L]=180000 (timeout at 120 seconds and max inactivity timeout at 180 seconds)
1 2 3 4 5 6 -*---*-----*-----*-----*-----*
1 - 20 seconds - at this point "last transfer" or "last received" time is updated. 2 - 60 seconds - watchdog runs - it check the connection and says "ok, last client transfer was 20s ago - but it’s lower than both inactivity (so don’t disconnect) and timeout (so don’t send ping). 3 - 120 seconds - 2nd check - last transfer was 100s ago - still lower than both values - do nothing. 4 - 180 seconds - 3rd check - last transfer was 160s ago - lower than inactivity but greater than delay - ping it sent. 5 - 240 seconds - 4th check - last transfer was 220s ago - client still has not responded, watchdog compares idle time to max-inactivity-timeout and finds that it is greater, connection is terminated.
It is possible that the connection is broken, and could be detected during the sending of a ping and the connection would be severed at step 4 instead of waiting for step 5. NOTE This MAY cause JVM to throw an exception.
NOTE: Global settings may not be ideal for every setup. Since each component has its own settings for max-inactivity-time you may find it necessary to design custom watchdog settings, or edit the inactivity times to better suit your needs. Below is a short list of components with thier default settings:
bosh/max-inactivity-time[L]=600 c2s/max-inactivity-time[L]=86400 cl-comp/max-inactivity-time[L]=180 s2s/max-inactivity-time[L]=900 ws2s/max-inactivity-time[L]=86400
Again remember, for Watchdog to properly work, the max-inactivity-time MUST be longer than the watchdog_timeout setting