connection timeouts vs. tcp timeouts - Hercules bug tracker archive

Issue information

Issue ID

#3977

Status

Done

Severity

None

Started

Hercules Elf Bot

Dec 27, 2009 19:09

Last Post

Ind

Apr 16, 2013 6:52

Confirmation

Yes (0)

No (1)

Hercules Elf Bot - Dec 27, 2009 19:09

Originally posted by [b]theultramage[/b]
http://www.eathena.ws/board/index.php?autocom=bugtracker&showbug=3977

eAthena implements a custom application-level connection keepalive/timeout mechanism. It keeps track of when the remote side last sent some data, and if it's past a timeout value it closes its socket. The purpose of this feature is not documented, however I'm guessing it has something to do with disconnecting players that lag out but the system does not feel like closing the socket, and with dropping connections to servers that freeze.

A consequence of this feature is that servers and clients need to send out artificial traffic (pings) often, otherwise they get forcibly disconnected. This puts extra requirements on the code and complicates the protocol (and packet traces).

I traced this feature back to r924, where MouseJstr added it to deal with "some NAT based routers that are not dropping the TCP connection when the aliased machine goes offline abnormally". If this is the only purpose, then he's handling a TCP-level issue as an application-level problem.

My suggestion is to remove this timeout and keepalive thing, and instead delegate the task of managing the connection timeout to TCP and the OS. If we can control a socket's tcp timeout (and I hear we can), then all that's needed is a bunch of setsockopt() calls.

One thing I'm wondering about is, how would such a setup behave if the remote side goes into an infinite loop. Currently, the connection times out after 1 minute. Would a tcp-level connection do the same? We want to avoid situations where the charserver freezes, and the mapserver keeps shoveling data into the send buffer until all memory runs out.

Hercules Elf Bot - Dec 28, 2011 2:41

Originally posted by [b]arp[/b]
This is quite interesting; I am of the opinion that the original fix is fine, and that it should _not_ be removed, although as the issue suggests, this is solved at the wrong level. My reasoning is this: If you ship a devices that uses a certain chip, and discover after shipping that the chip has a problem that causes unexpected behavior, but can be solved more easily or feasibly at another level, then that is the right way to go. For now, removing it means that, for example, you have 10 machines behind a router using NAT, and all of them drop unexpectedly, those 10 machines will have their connections kept alive by the router. They will be kept alive _until_ that machine comes back online.

Unhelpfully, the issue uses some pretty vague definitions, and I don't think that this is a problem in either context: server<->server, or server->nat<-client.

As far as I understand, this will cause unnecessary effort, for a problem that (at the moment) does not exist. As such, this is a noop.

I'm voting to close this.

Ind - Apr 16, 2013 6:52

I agree with the opinion above. Although there is some overhead this gives the server some aware-ability that would otherwise not exist, so how the hell we know we're not stuck?