Why KeepAlive Heartbeats should be on Dedicated Socket Stream
This may sound like a trivial thing, a common sense, yet it is not the first time I’ve seen this issue, even to an experienced game developer who has worked on more than one networking projects.
It may be more of an oversight or possibly, it’s so trivial and common sense (once you’ve explained the issue), that the developer(s) may have just assumed it will be and just made an assumptions about having keep-alive heartbeat messages to be asynchronous and on a separate thread.
My pages and blogs tends to be more about anti-pattern than design pattern. To me, learning from personal and/or others mistakes are more educational because there is a story to tell of “this is what happened and this is the consequences I’ve paid for the mistakes”.
In any case, if you ask any programmers who’s worked (at least a little) on networking games will tell you that (most likely) it is just pure common sense to make heartbeat messages (a.k.a. keepalive, pingpong, etc) should be on at least asynchronous (bi-directional so it does not block on incoming and outgoing messages) and preferably on a separate thread (none-blocking). IMHO, I think it’s the opposite, it should be separate threaded but optionally asynchronous, or to be more specific, on a separate stream dedicated just for keep-alive/heartbeat (not necessarily separate thread).
Normally (in my opinion based on experiences) a heartbeat/keep-alive message should be:
- On a separate thread – on the client side, heartbeat should be scheduled (sleep) and sent on a separate thread, while on the server side it should poll or trigger (commonly as a worker thread on server side if polling or some kind of I/O Completion Port trigger to wake up sleeping thread) to process it.
- From the client side – it should be up to the client to inform the server that it wants to stay alive. One of the main purpose of heartbeat messages (some would say only purpose) should be so that if server does not hear from that client within expected window of time, it will disconnect it. On a WAN based server, this is more obvious because you want to avoid DoS (unless you have some kind of tar-pit mechanism which does not consume resources) so that you can release the resources allocated to the socket(s) before you run out of heap.
- Keep-alive message should only be sent from the client if in idled mode (optional) – why bother sending keep-alive message to the server if other messages (such as positional or action messages) are already being sent. Again, the purpose of keep-alive is so that the server is aware that the client is not sleeping (i.e. hang, even though TCP/IP socket is still open). This is a bit more tedious since it would require some state tracking of when the last message was sent to the server.
- Preferably asynchronous – on both the client side sending the heartbeat message and server side receiving it, in case the sender (client) requires an acknowledgment of response (from the server), you want it asynchronous so that incoming and outgoing streams. This is somewhat optional for it is usually based on design (i.e. if you want to determine the ping-time so that you can anticipate on dead-reckoning or want to have some kind of dashboard or HUD to show to the user the round-trip). You should not count on it for I’ve seen situations in the past where the network was setup with synchronous HUB, in which all messages were forced to be uni-directional even if the code was capable of handling bidirectional. Sometimes, you become limited by hardware…
Just as this definition of KeepAlive (although this page is for HTTP server based, the concept is the same), keep-alive mechanism should at least have intervals which are negotiated at the beginning of the connection (i.e. the server side informs the client that it expects keep-alive messages every n seconds). I usually also give some mercy time on the server side for I’ve seen cases where the client would squeeze in at the last millisecond of window-of-opportunity to send its KeepAlive message in which the delay in the network has caused server had disconnected and milliseconds later, that KeepAlive message arrives.
Because keep-alive/heartbeat messages are so small, it may be argued that it should be sent (from the client-side) every interval even if other messages were sent to indicate that it’s still connected, some may counter-argue that it’s wasteful. However way it is done is irrelevant, only thing that matters is that it is cleanly integrated and trusted that it would do as expected/designed. Meaning:
- Client sends a keep-alive on schedule
- Server receives keep-alive message within the expected time-window so it won’t forcefully disconnect the client.
From my experiences, ones that counter-argue that they should not have to send keep-alive as long as any kind of messages are being sent to the server to indicate that the client is alive, have probably made a naive assumptions that server is implemented to treat the message stream in none-serialized form.
The problem I’ve seen from serializing the stream is the following scenario (and it does not matter whether it is on the server or client side, the effect is the same):
- Client is sending messages and the server updates its lastKeepAliveTime (or resets its nextExpectedKeepAliveTime – implementations is irrelevant for there are many ways to do this)
- The server model (although asynchronous) is serialized on each clients’ incoming message streams, and processes the requests in order received.
- One of the messages just happens to be the type of message which has a bug or does not comply to the requirements of agreed upon time-slice required by design, that it blocks for some time. For example, that message just happens to query an SQL or another server for data, which has the potential to block (short period time-out is the key to resolve this, but makes it quite difficult to handle for there will be more error cases than success).
- The same type of messages are stacked few more times in the message queue, and the client goes idle.
- Server is being blocked for a while now, busy processing these (incoming) messages.
- Client sends a KeepAlive message because it hasn’t sent any messages in a while.
- That KeepAlive goes into the tail of the incoming message-queue
- The scheduler kicks in (on a separate thread) because it is time to check if the client has sent its last KeepAlive. Keep in mind that client is sending on the same stream for both normal message and KeepAlive message.
- Because the KeepAlive message is still at the back of the queue, even though the client had disciplined to send the KeepAlive, server has determined it is going to disconnect it on the other thread.
All in all, by dedicating a separate incoming steam (which would mean two or more ports/socket per client) would cleanly manage the heartbeat thread from disconnecting the client on the server side.
Again, all these issues I’ve mentioned seems quite trivial and common sense, yet it has been ignored for one reason or another. Having a separate stream for keep-alive/heartbeat message would resolve these issues, but that would be (sometimes) difficult (actually, more like “risky”) to integrate late in the development (it’s harder to add code late in the project, especially multi-thread implementations for it can introduce new and more complex bugs). It also requires negotiation between client team and server team (if there are separation of such, rather than just “network team”) because both side has to agree that they will dedicate a separate socket for just the heartbeat message.
Advantages and Disadvantages of Separate Socket
There are some (possibly, quite a few) disadvantages to having a separate socket for keep-alive/heartbeat messages as much as there are advantages.
I’ve seen edge-cases in the past (on the client side) in which DirectX had hung (blocked indefinitely) on the main thread but the separate thread for KeepAlive message hasn’t. This caused an issue where the server was receiving the heartbeat messages so it would not disconnect the client, yet there were no transaction messages coming in.
From the server side perspective, as long as the heartbeat message is being processed, it cannot (will not) disconnect the client. On a stateless server mechanism where it trusts the client to direct the server what it is up to, server has no idea what to expect so it cannot predict such as “I’ve not received positional update in a while so I better disconnect it” for it does not know if the client is paused or not.
These kind of situations are based on case-by-cases and is based on design more than anything else. If the design has specificed that server is not stateless, possibly, it can have made more intelligent decisions based on conditions. It’s always (and should be) based on what design demands, but as a programmer, it is up to you to inform the architect or designers of the consequences of such edge-cases.
Or what about cases of DoS where the attacker has figured out that as long as they can keep the socket updated with heartbeat message (while also keeping the main socket alive) that the server will not disconnect, causing starvations of resources?
What about a bug in the client side in which (due to complexities of multi-threading) the client has disconnected from the main socket but kept the heartbeat socket alive, reconnected and opened another socket for main and heartbeat (total of 3 sockets with the zombie/dangling socket from the bug)? Of course, it should be the discipline of the server side to disconnect that zombie socket if the main socket is disconnected, but you’d probably not catch that bug until you’ve encountered it (depends on the implementations of course, if done right, you’d never encounter this bug).
I’ve seen implementations which treated each socket as separate thread on the server side, that can cause some hair-pulling experiences if decisions are made as an afterthought of allowing multiple sockets per client too late in the project (near production release).
Heartbeat/keep-alive messages are trivial concept, yet if not thought out and designed early in project, it can backfire, so don’t forget to at least have some write-up in your design stage of your networking structure for both client and server so that it won’t bite you too late in your project. At the end of the day, both the client and server side programmer(s) should be making sure what you can do to assure that the clients won’t get disconnected for wrong reason and it will require some cooperations and negotiations.
Recent Comments