Issue information

Issue ID
#6123
Status
Fixed
Severity
High
Started
Hercules Elf Bot
Jun 27, 2012 13:55
Last Post
Hercules Elf Bot
Jul 22, 2012 23:30
Confirmation
Yes (1)
No (0)

Hercules Elf Bot - Jun 27, 2012 13:55

Originally posted by [b]Cookie[/b]
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000542857 in battle_delay_damage_sub (tid=<value optimized out>,
tick=1672996460, id=2015079, data=139870876310216) at battle.c:206
206 if( id == dat->src->id &&
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 0x0000000000542857 in battle_delay_damage_sub (tid=<value optimized out>,
tick=1672996460, id=2015079, data=139870876310216) at battle.c:206
dat = 0x7f3639e59ac8
target = 0x7f363d886cec
#1 0x0000000000586daf in do_timer (tick=1672996460) at timer.c:370
tid = 4528
diff = 0
__FUNCTION__ = "do_timer"
#2 0x000000000058432c in main (argc=1, argv=0x7fff65b18cb8) at core.c:300
next = <value optimized out>
I've dug through the source (I've got decent C experience), and I'm completely stumped on why this is happening. Here's the core dump and I'll provide anything else necessary.

Hercules Elf Bot - Jun 27, 2012 14:02

Originally posted by [b]Ind[/b]
dat->src most likely null for some super weird reason o_O the following probably fixes
[code]
if( id == dat->src->id &&[/code]
to
[code]
if( dat->src && id == dat->src->id &&[/code]
i wouldn't call this a final fix though since dat->src being broken means something else is breaking it

This post has been edited by Ind on Jun 27, 2012 14:15

Hercules Elf Bot - Jun 27, 2012 14:55

Originally posted by [b]Cookie[/b]
Ind, my friend, you're awesome. Thanks for the quick fix. I'll apply it and keep digging around on my end. I'm curious as to what else is breaking it. /swt I just moved to rAthena... I figure I'd break something. lol.

[b]EDIT:[/b]
Not sure if this is relate-able, however, I'm receiving these errors in the console:
[code]
[Error]: map_freeblock_timer: block_free_lock(2) is invalid.
[Error]: map_freeblock_timer: block_free_lock(1) is invalid.
[Error]: map_freeblock_timer: block_free_lock(3) is invalid.
[/code]
I'm going to look dig into the source more. The above map_free error isn't crashing me, fyi.

Thanks a lot. :)

This post has been edited by Cookie on Jun 27, 2012 15:23

Hercules Elf Bot - Jun 27, 2012 15:26

Originally posted by [b]Ind[/b]
these errors means the block lock reference counter is mismatching at some point; its not related to the snippet i provided you above however since it doesn't keep any locks unchanged

Hercules Elf Bot - Jun 27, 2012 15:37

Originally posted by [b]Cookie[/b]
Thanks for the info. :)

Hercules Elf Bot - Jun 28, 2012 0:39

Originally posted by [b]Cookie[/b]
[CODE]
Program terminated with signal 11, Segmentation fault.
#0 0x000000000054285c in battle_delay_damage_sub (tid=<value optimized out>,
tick=1713031938, id=2014330, data=140004230941104) at battle.c:206
206 if( dat->src && id == dat->src->id &&
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 0x000000000054285c in battle_delay_damage_sub (tid=<value optimized out>,
tick=1713031938, id=2014330, data=140004230941104) at battle.c:206
dat = 0x7f554673c9b0
target = 0x7f554650ecfc
#1 0x0000000000586dbf in do_timer (tick=1713031943) at timer.c:370
tid = 1060
diff = -5
__FUNCTION__ = "do_timer"
#2 0x000000000058433c in main (argc=1, argv=0x7ffffa0448d8) at core.c:300
next = <value optimized out>
(gdb)
[/CODE]

Any ideas? :)

With installed debug infos, here's a better error dump.
[CODE]

Program terminated with signal 11, Segmentation fault.
#0 0x000000000054285c in battle_delay_damage (tick=0, amotion=2014330,
src=0x7f554650ecfc, target=0x7f554673c9b0, attack_type=1713032137,
skill_id=1713031938, skill_lv=-100374869, damage=2376, dmg_lv=4194584792,
ddelay=1) at battle.c:263
263 if (src->type != BL_PC && amotion > 1000)
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 0x000000000054285c in battle_delay_damage (tick=0, amotion=2014330,
src=0x7f554650ecfc, target=0x7f554673c9b0, attack_type=1713032137,
skill_id=1713031938, skill_lv=-100374869, damage=2376, dmg_lv=4194584792,
ddelay=1) at battle.c:263
dat = <value optimized out>
sc = <value optimized out>

[/CODE]
I've altered 263 to reflect -
[code]
if (dat->src && src->type != BL_PC && amotion > 1000)
[/code]
I will see what happens. Let me know if you have any ideas of what it is.

Before I added the above change, I received this crash:
[code]
Program terminated with signal 11, Segmentation fault.
#0 mob_log_damage (md=0x7f0898dfaafc, src=0x7f089726ce34, damage=153)
at mob.c:2157
2157 char_id = hd->master->status.char_id;
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 mob_log_damage (md=0x7f0898dfaafc, src=0x7f089726ce34, damage=153)
at mob.c:2157
hd = 0x7f089726ce34
char_id = 0
flag = 1
#1 0x00000000004b1910 in mob_damage (md=0x7f0898dfaafc, src=0x7f089726ce34,
damage=<value optimized out>) at mob.c:2268
No locals.
#2 0x000000000048f055 in status_damage_ (src=0x7f089726ce34,
target=0x7f0898dfaafc, hp=<value optimized out>, sp=0, walkdelay=576,
flag=0, skill=0) at status.c:1190
status = 0x7f0898dfad9c
sc = 0x7f0898dfadf4
__FUNCTION__ = "status_damage_"
#3 0x00000000005429ad in battle_delay_damage_sub (tid=<value optimized out>,
tick=1718001439, id=110027904, data=139674899129880) at battle.c:212
dat = 0x7f0898bf2618
target = 0x7f0898dfaafc
#4 0x0000000000586e6f in do_timer (tick=1718001446) at timer.c:370
tid = 4634
diff = -7
__FUNCTION__ = "do_timer"
#5 0x00000000005843ec in main (argc=1, argv=0x7fff480916e8) at core.c:300
---Type <return> to continue, or q <return> to quit---
next = <value optimized out>

[/code]

Side note, I'm running a server with 200+ players. So, it's hard to reproduce some of this unless it happens live. :x

This post has been edited by Cookie on Jun 28, 2012 2:07

Hercules Elf Bot - Jun 28, 2012 2:36

Originally posted by [b]Ind[/b]
I think the lock mismatch issue is what is giving your problems; you should check anywhere in your custom code where it uses map_freeblock_lock/unlock -- the issue being the mismatch causes the system to delete units before they're done being used by some functions, which ends up causing these weird crashes, in my opinion.

Hercules Elf Bot - Jun 28, 2012 2:40

Originally posted by [b]Ind[/b]
and by the way this what you did here
[code]
if (dat->src && src->type != BL_PC && amotion > 1000)[/code]
should at be
[code]
if (src && src->type != BL_PC && amotion > 1000)[/code]

Hercules Elf Bot - Jun 28, 2012 2:43

Originally posted by [b]Ind[/b]
you should go after the lock problem though, because it'll break code over many other places oo not only these (as you see mob_log_damage also joined the crew)

Hercules Elf Bot - Jun 28, 2012 2:45

Originally posted by [b]Cookie[/b]
[font=tahoma,geneva,sans-serif][size=3]I haven't really touched the source, to be honest, since I moved to rAthena. :/ Nothing major, besides re-adding bindatcmd to the atcommand.c. What should I chase in my NPC scripts? I've got a helluva of modifications npc-wise.[/size][/font]

Hercules Elf Bot - Jun 28, 2012 3:01

Originally posted by [b]Ind[/b]
ah i thought you had a highly customized source or something, as soon as I read your last post I jumped to the code editor and have gone through all of the map_freeblock_lock/unlock instances and found 2 mismatches. one related to a status change of a elemental (sorc summon) skill and other to RETURN_TO_ELDICASTES and ALL_GUARDIAN_RECALL skills. The fixes are available in [rev=16349] please let me know if the crashes stop or continue.

Hercules Elf Bot - Jun 28, 2012 7:02

Originally posted by [b]Cookie[/b]
[code]
#0 0x000000000054290c in battle_delay_damage_sub (tid=<value optimized out>,
tick=1736408851, id=2009430, data=140350851725744) at battle.c:206
206 if( dat->src && id == dat->src->id &&
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 0x000000000054290c in battle_delay_damage_sub (tid=<value optimized out>,
tick=1736408851, id=2009430, data=140350851725744) at battle.c:206
dat = 0x7fa5faa939b0
target = 0x7fa5e3e57714
#1 0x0000000000586e6f in do_timer (tick=1736408851) at timer.c:370
tid = 1782
diff = 0
__FUNCTION__ = "do_timer"
#2 0x00000000005843ec in main (argc=1, argv=0x7fff04e0f9f8) at core.c:300
next = <value optimized out>
[/code]

Same thing, again. Any ideas?

This post has been edited by Cookie on Jun 28, 2012 7:06

Hercules Elf Bot - Jun 28, 2012 9:03

Originally posted by [b]Ind[/b]
[s]that last gdb report, was it updated to 16349?[/s] i was told over irc it was.

This post has been edited by Ind on Jun 28, 2012 9:24

Hercules Elf Bot - Jun 28, 2012 9:36

Originally posted by [b]Ind[/b]
I've used my editor to go through every instance of map_freeblock_stuff within the source as it can be seen in the screen below, I've gone through everyone of them more than once and didn't find any possible mismatching case. scary stuff.
[attachment=2447:Screen Shot 2012-06-28 at 6.35.02 AM.png]

Hercules Elf Bot - Jun 28, 2012 11:06

Originally posted by [b]Cookie[/b]
[code]
[Error]: map_freeblock_timer: block_free_lock(1) is invalid.
[/code]

On the console again.

Hercules Elf Bot - Jun 28, 2012 11:06

Originally posted by [b]Cookie[/b]
Sorry, it double posted.

Just a FYI, I also did a search like you did on my source (while I do have a few edits they don't affect these areas) and all checked out... it seems. Didn't find anything returning without being unlocked, et cetera. :/ Still getting that console error... scary stuff (as Ind says, XD)

This post has been edited by Cookie on Jun 28, 2012 11:17

Hercules Elf Bot - Jun 28, 2012 12:29

Originally posted by [b]Wildcard[/b]
probably r15314:
the check removed there
[code]map_id2bl(id) == dat->src[/code]
used to ensure that the source pointer is not dangling. delayed reflect, source logging out, boom. null checks won't help there.

Pretty sure that's it, but better look again to make sure :D

This post has been edited by Wildcard on Jun 28, 2012 12:31

Hercules Elf Bot - Jun 28, 2012 13:27

Originally posted by [b]Cookie[/b]
Makes sense. I'll test this out. :)

Hercules Elf Bot - Jun 28, 2012 13:56

Originally posted by [b]Cookie[/b]
[code][Error]: map_freeblock_timer: block_free_lock(1) is invalid.
[/code]

We're getting more of them in the console all showing (1).

[code]
if ( target && dat && target->prev != NULL && !status_isdead(target) ) {
if( dat->src && map_id2bl(id) == dat->src && id == dat->src->id &&
[/code]
Edit I made.

This post has been edited by Cookie on Jun 28, 2012 13:59

Hercules Elf Bot - Jun 28, 2012 14:56

Originally posted by [b]Ind[/b]
[quote name='Wildcard' timestamp='1340886598' post='11346']
probably r15314:the check removed there[code]map_id2bl(id) == dat->src[/code]used to ensure that the source pointer is not dangling. delayed reflect, source logging out, boom. null checks won't help there.Pretty sure that's it, but better look again to make sure :D
[/quote]
he has pointers broken in different areas and many map_freeblock_invalid messages within the console. as for r15314 xazax and i have tested several scenarios prior to getting there, it'd not be enough to crash. when the pointer gets freed screwed or whatever the dat->src->id would give some negative/random value, never null.

This post has been edited by Ind on Jun 28, 2012 14:57

Hercules Elf Bot - Jun 28, 2012 15:58

Originally posted by [b]Wildcard[/b]
[quote name='Ind' timestamp='1340895383' post='11356']
he has pointers broken in different areas and many map_freeblock_invalid messages within the console. as for r15314 xazax and i have tested several scenarios prior to getting there, it'd not be enough to crash. when the pointer gets freed screwed or whatever the dat->src->id would give some negative/random value, never null.
[/quote]

not sure about the freeblock_lock stuff, but a dangling pointer should not be dereferenced, period. it [b]may[/b] crash at any given time, and not crash at others, which makes it unpredictable. the segfault clearly resulted from it. it's also different between OSes/compilers/whatnot, and generally unpredictable.

This post has been edited by Wildcard on Jun 28, 2012 15:59

Hercules Elf Bot - Jun 28, 2012 20:08

Originally posted by [b]Ind[/b]
[quote name='Wildcard' timestamp='1340899139' post='11357']
[quote name='Ind' timestamp='1340895383' post='11356']he has pointers broken in different areas and many map_freeblock_invalid messages within the console. as for r15314 xazax and i have tested several scenarios prior to getting there, it'd not be enough to crash. when the pointer gets freed screwed or whatever the dat->src->id would give some negative/random value, never null.[/quote]not sure about the freeblock_lock stuff, but a dangling pointer should not be dereferenced, period. it [b]may[/b] crash at any given time, and not crash at others, which makes it unpredictable. the segfault clearly resulted from it. it's also different between OSes/compilers/whatnot, and generally unpredictable.
[/quote]
thats what the whole lock stuff is for, when it breaks (or say it wasn't available) these pointer checks would be required all over the place.

Hercules Elf Bot - Jun 28, 2012 21:02

Originally posted by [b]Wildcard[/b]
[quote name='Ind' timestamp='1340914138' post='11361']
thats what the whole lock stuff is for, when it breaks (or say it wasn't available) these pointer checks would be required all over the place.
[/quote]

said lock is not active when that first check happens, though (obviously, since it's "asyncronous", evoked by a timer) dat->src does get free'd whenever the disconnect code is done cleaning up. if it has not yet been free'd, it would not contain garbage data, either. maybe I'm still missing something but the more I look at it, the more I think not :D

Hercules Elf Bot - Jun 29, 2012 5:52

Originally posted by [b]Ind[/b]
oh, you're absolutely right. whops /me shame

Hercules Elf Bot - Jul 1, 2012 11:21

Originally posted by [b]Ind[/b]
Should be fully fixed in [rev=16359]

Hercules Elf Bot - Jul 3, 2012 17:05

Originally posted by [b]Wildcard[/b]
..I'm not sure. My point was we need [i]more[/i] of these mapid lookups, not less :D
From my understanding, for [b]any[/b] bl somehow passed to a timer, we need to perform a lookup, or run the risk of dereferencing a dangling pointer, and subsequently the possibility of an access violation/segfault.
For that matter, we might as well never pass a bl* directly to a timer, but instead always just save the ID and perform said lookup, just for clarity's sake. As of r16359, I can make both dat->src and dat->target point to garbage memory, which is then dereferenced and should mean it can potentially crash at any time. Pending confirmation from users here. Seems to be easier to reproduce on *nixes. Another curious thing I noticed regarding this, the old code (pre-15314) also checked against the case that the map ID had been re-assigned. Not entirely sure of the implications here, though.

Bottom line, lets see if there are any more crashes :D

Hercules Elf Bot - Jul 8, 2012 21:36

Originally posted by [b]Cookie[/b]
Wildcard, you're right. Without the map_id2bl(id) == dat->src in the if statement, it still crashes. XD Didn't take long for my server to BOOM!

Hercules Elf Bot - Jul 8, 2012 22:21

Originally posted by [b]Cookie[/b]
[code]
/debug/lib64/libnss_files-2.12.so.debug...done.
done.
Loaded symbols for /lib64/libnss_files-2.12.so
Core was generated by `./map-server_sql'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000542b74 in battle_delay_damage_sub (tid=<value optimized out>,
tick=2654204114, id=110012417, data=139743418977048) at battle.c:204
204 if ( dat && dat->target && map_id2bl(id) == dat->src && dat->target->prev != NULL && !status_isdead(dat->target) ) {
Missing separate debuginfos, use: debuginfo-install openssl-1.0.0-20.el6_2.5.x86_64 pcre-7.8-3.1.el6.x86_64
(gdb) bt full
#0 0x0000000000542b74 in battle_delay_damage_sub (tid=<value optimized out>,
tick=2654204114, id=110012417, data=139743418977048) at battle.c:204
dat = 0x7f188cd90b18
#1 0x00000000005870cf in do_timer (tick=2654204123) at timer.c:370
tid = 2608
diff = -9
__FUNCTION__ = "do_timer"
#2 0x000000000058464c in main (argc=1, argv=0x7fff96141868) at core.c:300
next = <value optimized out>

[/code]

Any ideas? I applied the revision above too.

This post has been edited by Cookie on Jul 8, 2012 22:24

Hercules Elf Bot - Jul 9, 2012 12:10

Originally posted by [b]Ind[/b]
that is modified. use the latest code available in battle.c

Hercules Elf Bot - Jul 22, 2012 22:39

Originally posted by [b]Cookie[/b]
Reproduced the issue again on Divinity. With around 400 online, someone reproduced the issue. I've added the check that prevents this in [rev=16476] thanks to WildCard! :)

Hercules Elf Bot - Jul 22, 2012 23:30

Originally posted by [b]Ind[/b]
pointers are never deferenced in rathena (unless you manually var = null), they end up with random/corrupted memory (never null after being afreed) -- I'm not saying it couldn't have fixed your crashing, however something else should be crashing it (within the bounds of the damage delay system). also r16476 broke the official behavior on reflect shield.

This post has been edited by Ind on Jul 22, 2012 23:37