Login server & Map server issue for large server

Issue information

Issue ID

#8475

Status

Invalid

Severity

None

Started

Jattington

Dec 22, 2014 0:49

Last Post

Haru

Jan 5, 2015 0:04

Confirmation

N/A

Jattington - Dec 22, 2014 0:49

Our login server crashes fairly often at the moment

We put it in debugger and this is what we got

[url="http://pastebin.com/AmKP2BFm"]http://pastebin.com/AmKP2BFm[/url]

Anyone know what's up?

We are a 1000 player concurrent server not including venders.

This post has been edited by Jattington on Dec 22, 2014 20:02

Haru - Dec 22, 2014 2:49

The backtrace is missing debug symbols, so there isn't much that we can do with this info.[quote]
Reading symbols from /home/ggro/production/login-server...(no debugging symbols found)...done.[/quote]You should recompile Hercules in debug mode, so that symbols are available. If it's compiled in debug mode, you could try and disable LTO, since in some systems it prevents gdb from getting detailed symbols.

Dastgir - Dec 22, 2014 8:56

Provide the debug info as mentioned by haru to know exact solution,
(Though by seeing logs, it seems some memory allocation is failing while doing queries)

Jattington - Dec 22, 2014 20:02

Thank you for the replies guys, we've had some other issues with the map server too.

I will update w/ the login server debug info.

Update 2:

Our map server just froze/crashed and this is the error we got:[code=auto:0] #0 buffered_vfprintf (s=s@entry=0x7ffff64881c0 <_IO_2_1_stderr_>, format=format@entry=0x617e59 "%s ", args=args@entry=0x7fffffffd818) at vfprintf.c:2319 #1 0x00007ffff61147de in _IO_vfprintf_internal ( s=s@entry=0x7ffff64881c0 <_IO_2_1_stderr_>, format=format@entry=0x617e59 "%s ", ap=ap@entry=0x7fffffffd818) at vfprintf.c:1289 #2 0x00000000005d9fe9 in VFPRINTF (file=0x7ffff64881c0 <_IO_2_1_stderr_>, fmt=0x617e59 "%s ", argptr=argptr@entry=0x7fffffffd818) at showmsg.c:497 #3 0x00000000005da237 in FPRINTF (file=<optimized out>, fmt=fmt@entry=0x617e59 "%s ") at showmsg.c:589 #4 0x00000000005d95c7 in vShowMessage_ (flag=flag@entry=MSG_ERROR, string=0x5fd088 "Server received crash signal! Attempting to save all online characters!\n", ap=ap@entry=0x7fffffffda48) at showmsg.c:690 #5 0x00000000005d9284 in ShowError (string=<optimized out>) at showmsg.c:808 #6 0x00000000004c823e in do_abort () at map.c:5469 #7 0x00000000005dcb75 in sig_proc (sn=11) at core.c:127 #8 <signal handler called> #9 main (argc=<optimized out>, argv=<optimized out>) at core.c:258[/code]

We used disabled LTO & compiled in debug mode.

This post has been edited by Jattington on Dec 22, 2014 22:24

Jattington - Dec 22, 2014 22:50

Using HTOP this is what we got and w/ memory allocation:

24761 ********** 20 0 597932 424736 5680 R 41.3 0.3 7:53.90 map-server

with 800 players online

Haru - Dec 23, 2014 7:39

The mapserver backtrace shows a segmentation fault (segnal number 11 is SIGSEGV), and then an issue of some kind in the signal handler routine (what's the message it showed before you typed 'bt'?). But a segmentation fault in main(), at line 258 of core.c doesn't make much sense (checking against a clean Hercules -- not sure what you have at line 258 in your case).

Memory usage (assuming kilobytes, since it should be htop's default) seems a bit high for 800 players, but still reasonable. CPU usage looks really, unreasonably, high (but that might depend on your CPU).

Now, since you mentioned having a thousand players online, concerning your login server issues -- could it be that you increased FD_SETSIZE and you're running a linux OS? If that's the case, that's unhealthy and not recommended to do (i.e. it will cause memory corruption and crashes)

HotshotGG - Dec 23, 2014 10:16

Hey Haru, I actually thought it was related to something along the lines of that. We have an extended error log which we want to get out but unfortunately our developer is mia. What is the appropriate work around for increasing connections to the server? Do we start working with a windows based OS (which had 4000 connections allowed), I remember something being said about it being able to hold more connections.

Edit: One more thing to note is that the login server stays up, not sure if that rules out the problem.

Thanks for replying btw! :lol:

This post has been edited by HotshotGG on Dec 23, 2014 10:17

Haru - Dec 23, 2014 10:42

I don't really recommend using Windows on a production server. It would probably be safer to use some *BSD flavor (FreeBSD would be a great candidate), where you can safely edit FD_SETSIZE. There's some general info about FD_SETSIZE i.e. [url="https://books.google.it/books?id=ptSC4LpwGA0C&pg=PA166&lpg=PA166&source=bl&ots=Ks1GOjarQm&sig=fdcK3GNGr0-rkGO8dJe6Sr7IpFc&hl=en&sa=X&ei=hUWZVJ_cOInkaJrjgOgE&redir_esc=y#v=onepage&q&f=false"]on this book[/url].

In any case, general consensus is that select() should never be used, and poll() is better and does not use fd_set (so it is not affected by this issue). I have never tried using poll() on Hercules because I don't have any benchmark data, to ensure it won't impact performance negatively. If you want to give it a try, poll() is pretty much a drop-in replacement for select(), so the source code edits should be trivial. (If it works and performs well, I accept pull requests!)

HotshotGG - Dec 23, 2014 11:10

Great, thanks for the update Haru.

Lassander - Dec 24, 2014 13:26

I'd like to add here since Hotshot/Jattington hasn't that he's using Harmony so that might explain some of the memory/CPU issues. It might also mean that core.c is altered.

Also talking to an unnamed person on staff, they don't think it's running windows, and that debug log says that it's running on a RHEL like system. I'm not sure why it'd say that if it was on windows. (though I can see why for other RPM based Operating systems.)

This post has been edited by Lassander on Dec 24, 2014 13:45

Cookie - Dec 24, 2014 21:48

Core.c is unaltered. Yes, Harmony itself might be a source of the high CPU usage. However, there still is an issue of FD maximums which is even shown in the map console output. Of course we're not on a Windows based system. The actual crashes themselves were from former developers (prior to myself joining the team) adding a definition to socket.h with a define that overrides the system FD_SETSIZE macro which causes instability. I actually didn't remember that this was added at all. So, when I realized that... I removed the definition, recompiled and we haven't crashed since then. Although now we're receiving the proper error of 24 and "too many files" on the map-server console when we go past like 1,030+ players.

I've spoke with Trojal and others. The methods to increase the maximum amount of file descriptors I had already implemented (for example, ulimit -n, limits.conf nofile, and the sysctl maximums). After doing so and rebooting, the servers still only respect the soft limit imposed by the FD_SETSIZE in the getrlimit areas of the socket. I don't know that changing to a BSD based system would be the best option and Trojal / I spoke regarding rewriting the socket code (which I'm sure would incorporate the select -> poll as Haru pointed out).

Anyway, any help would be appreciated. :) Socket counterparts of *athena aren't my specialty so I'm always open for input or guidance on the issue. I feel this will benefit the entire Hercules emulator itself as we can share benchmark data, core code changes, and what-not. We're quite a large server and I'd be glad to guinea pig / contribute my code back to the project (@Haru).

-[member=Cookie]

This post has been edited by Cookie on Dec 24, 2014 21:52

Cookie - Jan 4, 2015 16:35

Here's my workaround for those on RHEL/Linux distributions:

Manually setting the FD_SETSIZE in the OS system headers (included by the emulator code) to a higher number of 8192 (for example) resolved the crashes. Keep in mind, you'll need to restart the server as well as make clean, recompile the server and ensure OS security settings are altered from their defaults as described below.

Additionally, ulimit will need to have hard and soft changed to a higher number (I used 65535) for the user of the running emulator process. You can do so for the current session with ulimit -Hn 65535, ulimit -Sn 65535, or ulimit -n 65535. To permanently set these values (which is ideal instead of manually setting for each session), edit /etc/security/limits.conf. You'll need to manually set for the current session after that file is edited. Furthermore, I manually set fs.file-max to a higher number, too. You'll want to sysctl -p to reload the file after editing. (Sidenote: There's all sorts of other settings you could Google around for and set in sysctl to optimize other core parts of the OS. I'd only advise doing so if you know what you're doing. For a server our size, I researched a lot initially)

Lastly, I'd advise after recompiling and running the server to verify the soft and hard limits actually being set in the *-server processes. Locate the process number by using netstat -tulpn | grep "port of one of the servers". Using the PID, you can cat /proc/<pid>/limits and read it from there.

Since then, all of the related crashes have ceased. The key here was actually making sure each place was changed properly, and also that you're not setting the FD_SETSIZE in the emulator code (which isn't ideal and doesn't make much sense when inclusive of system header files happens throughout the code with the OS level default of 1024) which I've heard of others doing. The ulimit and sysctl file max limits are just to ensure the process isn't restricted by security measures.

Honestly, when I have time, I plan to re-write the socket layer (if Trojal doesn't beat me to the punch) and share the code with the project (perhaps packaged as an optional macro) using poll() instead of select(). Ideally, a lot of the layer would need to be optimized and re-written anyway. I think it would help the project and give extra options because I'm under the belief all of these changes to the OS / security shouldn't be necessary for an application to fully function especially as we're cross-platform.

Thanks everyone for all the ideas and advice on the issue! Sadly, I originally knew of this workaround and was hesitant to use it at first. :P Guess I should follow my knowledge and gut in the future, lol.

This post has been edited by Cookie on Jan 4, 2015 16:42

csnv - Jan 4, 2015 17:14

[quote name="Cookie" timestamp="1420389351"]
Manually setting the FD_SETSIZE in the OS system headers (included by the emulator code) to a higher number of 8192 (for example) resolved the crashes. Keep in mind, you'll need to restart the server as well as make clean, recompile the server and ensure OS security settings are altered from their defaults as described below.[/quote]
Changing the limit of FD_SETSIZE in the headers has no effect in the system if you don't recompile every program in the OS (Kernel include). You could just leave it redefined in the code source of Hercules.

All the limits changes in the OS environment is actually common knowledge when you deal with FD_SETSIZE. It was assumed the bug happened even with those limits correctly configured. It's not a workaround, it's the way file descriptors are modified in every linux distribution.

This post has been edited by csnv on Jan 4, 2015 17:23

Haru - Jan 5, 2015 0:04

[quote name="csnv" timestamp="1420391671"]
All the limits changes in the OS environment is actually common knowledge when you deal with FD_SETSIZE. It was assumed the bug happened even with those limits correctly configured. It's not a workaround, it's the way file descriptors are modified in every linux distribution.[/quote]In fact, by increasing FD_SETSIZE, you're triggering implementation-defined behavior. In other words, by doing so, you're signing up for an appointment with [url="http://catb.org/jargon/html/N/nasal-demons.html"]nasal demons[/url] to happen when you least expect it. Or worse, they may be hiding there, smashing your stack, overrunning buffers, playing with values in your program's memory, and you'll never know.

On some systems, there is a guarantee that it won't happen, because the developers decided they'd handle that case. On some others, there isn't.

As long as the user knows well their own system, and they're sure it won't cause any harm, then the suggested workaround is perfectly fine. For what it's worth, the Linux kernel (still as of version 3.x) defines __FD_SETSIZE to 1024 internally, overriding whatever value is set in the system headers.

Since the issue has been solved, I'll wontfix it. Please reopen if needed.