Issue information

Issue ID
#3250
Status
Fixed
Severity
Critical
Started
Hercules Elf Bot
Jun 15, 2009 0:18
Last Post
Hercules Elf Bot
Jun 15, 2009 0:18
Confirmation
N/A

Hercules Elf Bot - Jun 15, 2009 0:18

Originally posted by [b]theultramage[/b]
http://www.eathena.ws/board/index.php?autocom=bugtracker&showbug=3250

An ensemble produces a skill unit group that is shared between both performers. In case either of them cannot or will not continue the performance, the land skill needs to be removed and the partner's SC_DANCING status terminated.

We have identified a defect in the ensemble termination mechanism where the server will attempt to remove the land skill while it is already processing its removal, causing a server crash, unless the server is compiled without --enable-debug / DEBUG_MEMMGR and does not erase deallocated data.

What follows is a root cause analysis of the crash, then a revision history examination, and finally a listing of related bugreports.
CODE
scenario: bard + dancer play an ensemble, bard is the initiator, dancer runs out of sp

execution trace:
1. dancer gets status_change_end(SC_DANCING)
2. dancer's sce->val2 = 0 (disassociates the group_id)
3. skill_delunitgroup starts erasing land units, one by one (there are 81), using skill_delunit()
4. when i = 40, step 5 runs for the bard
5. whenever skill_delunit() runs, it calls skill_unit_effect(type:4) on people standing on the cell, which invokes skill_unit_onleft
6. //Check if you just stepped out of your ensemble skill to cancel dancing. [Skotlex]
7. skill_stop_dancing() sets sce->val2 = 0 and calls status_change_end(SC_DANCING) - which does nothing important
8. at the end of skill_stop_dancing(), there is
    if (group)
        skill_delunitgroup(NULL, group);
   which wipes out the rest of the group and ers_free's it
9. at this point control returns to skill_delunit() from step 4, where the data has already been trashed
10. an attempt is made to free the unit, which causes a segfault.
The execution trace's symmetric when the bard's the one that runs out of sp.

This crash is ancient. I have managed to trace it to as early as r6000, crashing along the same execution path.
However, in r5000 it goes somewhat differently. Back then, skill_delunitgroup() didn't have a 'src' argument, and always executed with the group's owner as source. In our scenario that would be the bard, who would still have SC_DANCING active, and therefore this code would run:
CODE
            if (sc_data && sc_data[SC_DANCING].timer != -1)
            {
                sc_data[SC_DANCING].val2 = 0; //This prevents status_change_end attempting to redelete the group. [Skotlex]
                status_change_end(src,SC_DANCING,-1);
            }
Notice how the comment talks about the exact circumstances of this crash?
However, on recent eathena, in our scenario this function executes with the dancer as 'src' instead of the bard, skipping over this code. However, if the initiator is the one disrupting the ensemble, then this code will again be skipped, producing the abovementioned scenario too. So, it is quite possible that this thing was broken long ago, r2679, or perhaps never worked correctly at all.

People who reported this crash:
2006-09-02 [858x W32 Sql] Map-server_sql Shutdown - user reports skill_delunitgroup group not found on loki's veil
2006-09-10 [svn All] Combo Skill (dancer+bard) - random person reports skill_delunitgroup errors on ensembles
2007-02-08 [svn 9798] Party Problems - Zephyrus mentions skill_delunitgroup group not found on loki's veil
2008-01-17 Unhandled Exception Occured - same crash
2008-01-19 Server Crashing - src_id: -572662307 skill_id: -8739
2008-02-09 Server Crashing (including Backtrace) - crashing by jobchanging
2008-03-11 Skill_delunitgroup Errors - same crash
2008-11-23 map server crash bug - crashing by logging out

The two near the end are pretty interesting, since they show it might be possible to reproduce by logging out or jobchanging, not just running out of sp.

This post has been edited by theultramage: Jun 15 2009, 05:06 AM