20 Nov, 2013, Splork wrote in the 6th comment:

Votes: 0

This post brought back a memory of an email which was buried in gmail for the past 9 years. It may not be very helpful but to someone it may, so here it is. The email is from one of our creators named Frobisher, he also was the developer of wintin95 and wintin.net. Obviously the directories and files are pointed towards our game but the general idea is quite similar for all.

I thought it might be useful to summarize how gdb can be used to look at
the cause of a crash.

Everyone who writes code for Sloth should learn how to do this.
Otherwise, how can you check that a crash wasn't caused by your own
code? So here's how you do it.

The core dumps are kept in directories on game.slothmud.org under
/home/sloth/live/corefiles. The directory names reflect the time and
date of the crash, for example crash.0409230504 happened on the 23rd of
September 2004 (040923) at 0504 Central Sloth time. To start looking at
the crash, cd to the relevant directory and look at its contents:

[sloth@game corefiles]$ cd crash.0409230504
[sloth@game crash.0409230504]$ ls
core.19381.gz lastlog.txt.gz sloth.gz
If the files are gzipped like this, you'll need to decompress them:

[sloth@game crash.0409230504]$ gunzip *
[sloth@game crash.0409230504]$ ls
core.19381 lastlog.txt sloth

core.19381 is a copy of the running Sloth code at the time the crash
happened. The number on the end changes: you don't need to worry about
it.
sloth is a copy of the sloth binary (the compiled code) that was running
at the time

lastlog.txt is (with any luck) a copy of the last few lines of the log
file, as sent to the coders mailing list at the time of the crash. More
often that not, though, it seems to be empty.

To analyse the crash you start gdb by typing 'gdb sloth corefile':

[sloth@game crash.0409230504]$ gdb sloth core.*
GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux-gnu"…

warning: exec file is newer than core file.
Core was generated by `/home/sloth/live/bin/sloth -c
/home/sloth/live/lib/sloth.config 6101'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/tls/libm.so.6…done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6…done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/libcrypt.so.1…done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /usr/lib/libz.so.1…done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/ld-linux.so.2…done.
Loaded symbols for /lib/ld-linux.so.2
#0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701

warning: Source file is more recent than executable.

2701 d->snoop.snooping->desc->snoop.snoop_by = 0;
(gdb)

You don't need to worry about most of the above, as long as it looks
something like this, you're ok. Note that Gdb warns that the source code
is more recent than the executable. Gdb is going to attempt to tell you
which lines of code caused the crash, we'll see how in a minute. To do
this, it refers to the source files at /home/sloth/sloth. What it's
saying here is that those source files have changed since this copy of
the code was compiled, so it might not always get things right. Usually,
this won't be a problem, and it's generally too much hassle to fix it,
so the best thing to do is to press on, but to bear it in mind.

Gdb is an interactive program, so you get a prompt at the bottom of the
screen. You can type 'help' for commands, but most of the commands are
used for debugging programs that are actually running at the time. We're
just looking at a core file, and we don't need many commands.

Just above the prompt in the output above, you can see that it says:

#0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701
2701 d->snoop.snooping->desc->snoop.snoop_by = 0;

(I have left out the warning here for clarity)

Gdb is telling you here what line of code was being executed when the
crash happened. If you look much further above, you'll see it says
'Program terminated with signal 11, Segmentation fault.'.

'Segmentation Fault' means that the code tried to refer to a piece of
memory that wasn't available. Typically this happens because a pointer
has a bad value. 98 crashes out of 100 are segmentation faults. 90 times
out of a hundred the pointer has an obviously silly value, like 0 or -1.

You can get a better idea of the context by asking gdb for a backtrace.
At any one time, the running code is in some function or other. That
function has been called by some other function, which was called by
some other function - and so on, all the way back to main(). A backtrace
shows you the complete sequence of calls, from main() through to the
currently running line of code. You ask for a backtrace by typing bt:

(gdb) bt
#0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701
#1 0x080c78e9 in game_loop () at comm.c:1091
#2 0x080c6be2 in run_the_game (port=6101) at comm.c:769
#3 0x080c6b24 in main (argc=4, argv=0xbfffea14) at comm.c:731
#4 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
(gdb)

each function call in this list is called a 'frame'. Frame 0 is always
the one containing the currently running code.

You'll see that gdb is telling us file names and line numbers for each
function call, and it's also showing the arguments that were passed in
to each function. All useful stuff.

We can look in more detail at a particular frame by switching to it with
the 'frame' command. Let's look at frame 0:

(gdb) frame 0
#0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701
2701 d->snoop.snooping->desc->snoop.snoop_by = 0;

You can get a bit more context above and below the line in question with
list:

(gdb) list
2696 sprintf(buf, "maxdesc is now %d", maxdesc);
2697
2698 /* Forget snooping */
2699 if (d->snoop.snooping)
2700 {
2701 d->snoop.snooping->desc->snoop.snoop_by = 0;
2702 d->snoop.snooping = 0;
2703 }
2704
2705 if (d->character)

You can use the print command to find the value of variables at the time
of the crash. Let's start to pull apart
d->snoop.snooping->desc->snoop.snoop_by:

(gdb) print d
$2 = (struct descriptor_data *) 0x11577608
(gdb) print d->snoop.snooping
$3 = (struct char_data *) 0xff40668

These two look reasonable - valid looking pointer values. We can have a
look at some of the values in the char_data structure to make sure it
looks ok:

(gdb) print d->snoop.snooping->player.name
$6 = 0xff40520 "Leizu"

Let's have a look at d->snoop.snooping->desc

(gdb) print d->snoop.snooping->desc
$4 = (struct descriptor_data *) 0x0

ok, that's why we crashed. d->snoop.snooping->desc points to 0, which is
an invalid memory location. SO when the code says
d->snoop.snooping->desc->snoop.snoop_by = 0, it tries to assign 0 to a
part of a structure sitting at memory location 0, realises that there
isn't memory at that address, and crashes.

From here on in, it's detective work. You don't need to know anything in
particular, or be a member of some super-secret clan or something, you
just need to have an inquisitive mind and a decent tool for browsing and
searching the source code: a tool that makes it easy for you to do text
searches (grep ?) so that you can work out where things are defined in
the code and where they are used. The Sloth code is too big for one
person to remember it all now, so when any of us debug a crash, unless
it happens to be something we're working on at the time, we almost
always start off by reading the structure definitions and the code
concerned, and working out from there what's going on. The more you read
the code, the more you'll understand it.

Don't expect to be able to work out the cause of every crash. Sometimes
the thing that crashes the code overwrites a lot of the useful
information, and when you try to do a backtrace, you either get
something that clearly doesn't make sense, or you get a pile of question
marks from gdb. Other times, it's just too hard to work out what's gone
wrong. And sometimes the cause of the crash happened a long time before
the crash itself, so there's only limited value in looking at the crash
itself. One example of this is memory corruption. Memory corruption is
typically caused by referring to memory after it has been freed, or by
freeing something that wasn't allocated in the first place, or by
running over the end of a piece of allocated memory. These errors are
particularly difficult to track down, because the effects often don't
surface for a long time. If you find a backtrace that ends in a call to
malloc_consolidate(), you are looking at a memory corruption problem and
it's not worth examining the crash file any further.

I hope that you will find this helpful. Although its a long mail, it's
easy to get started. Just do gdb sloth core.*, type bt and look at what
comes back. If you get into the habit of doing this regularly, and
reading around the lines of code that caused the crash, you'll soon get
to the point where you can start to spot patterns and kill a few bugs.

regards

Frobisher

NorCon MUSH

Accursed Lands

EmpireMUD 2.0