You must: a) Ensure your codebase is compiled with the -g flag, preferably -ggdb. (this makes stuff like names of functions appear in the debugger). b) run gdb from the directory in which you want to run the executable (iirc, DIKU derivatives run from the ../area directory) c) command gdb to run your executable, with any arguments. e.g. "run ../src/merc 4000"
This post brought back a memory of an email which was buried in gmail for the past 9 years. It may not be very helpful but to someone it may, so here it is. The email is from one of our creators named Frobisher, he also was the developer of wintin95 and wintin.net. Obviously the directories and files are pointed towards our game but the general idea is quite similar for all.
I thought it might be useful to summarize how gdb can be used to look at the cause of a crash.
Everyone who writes code for Sloth should learn how to do this. Otherwise, how can you check that a crash wasn't caused by your own code? So here's how you do it.
The core dumps are kept in directories on game.slothmud.org under /home/sloth/live/corefiles. The directory names reflect the time and date of the crash, for example crash.0409230504 happened on the 23rd of September 2004 (040923) at 0504 Central Sloth time. To start looking at the crash, cd to the relevant directory and look at its contents:
[sloth@game corefiles]$ cd crash.0409230504 [sloth@game crash.0409230504]$ ls core.19381.gz lastlog.txt.gz sloth.gz If the files are gzipped like this, you'll need to decompress them:
[sloth@game crash.0409230504]$ gunzip * [sloth@game crash.0409230504]$ ls core.19381 lastlog.txt sloth
core.19381 is a copy of the running Sloth code at the time the crash happened. The number on the end changes: you don't need to worry about it. sloth is a copy of the sloth binary (the compiled code) that was running at the time
lastlog.txt is (with any luck) a copy of the last few lines of the log file, as sent to the coders mailing list at the time of the crash. More often that not, though, it seems to be empty.
To analyse the crash you start gdb by typing 'gdb sloth corefile':
[sloth@game crash.0409230504]$ gdb sloth core.* GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"…
warning: exec file is newer than core file. Core was generated by `/home/sloth/live/bin/sloth -c /home/sloth/live/lib/sloth.config 6101'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/tls/libm.so.6…done. Loaded symbols for /lib/tls/libm.so.6 Reading symbols from /lib/tls/libc.so.6…done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/libcrypt.so.1…done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /usr/lib/libz.so.1…done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /lib/ld-linux.so.2…done. Loaded symbols for /lib/ld-linux.so.2 #0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701
warning: Source file is more recent than executable.
You don't need to worry about most of the above, as long as it looks something like this, you're ok. Note that Gdb warns that the source code is more recent than the executable. Gdb is going to attempt to tell you which lines of code caused the crash, we'll see how in a minute. To do this, it refers to the source files at /home/sloth/sloth. What it's saying here is that those source files have changed since this copy of the code was compiled, so it might not always get things right. Usually, this won't be a problem, and it's generally too much hassle to fix it, so the best thing to do is to press on, but to bear it in mind.
Gdb is an interactive program, so you get a prompt at the bottom of the screen. You can type 'help' for commands, but most of the commands are used for debugging programs that are actually running at the time. We're just looking at a core file, and we don't need many commands.
Just above the prompt in the output above, you can see that it says:
#0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701 2701 d->snoop.snooping->desc->snoop.snoop_by = 0;
(I have left out the warning here for clarity)
Gdb is telling you here what line of code was being executed when the crash happened. If you look much further above, you'll see it says 'Program terminated with signal 11, Segmentation fault.'.
'Segmentation Fault' means that the code tried to refer to a piece of memory that wasn't available. Typically this happens because a pointer has a bad value. 98 crashes out of 100 are segmentation faults. 90 times out of a hundred the pointer has an obviously silly value, like 0 or -1.
You can get a better idea of the context by asking gdb for a backtrace. At any one time, the running code is in some function or other. That function has been called by some other function, which was called by some other function - and so on, all the way back to main(). A backtrace shows you the complete sequence of calls, from main() through to the currently running line of code. You ask for a backtrace by typing bt:
(gdb) bt #0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701 #1 0x080c78e9 in game_loop () at comm.c:1091 #2 0x080c6be2 in run_the_game (port=6101) at comm.c:769 #3 0x080c6b24 in main (argc=4, argv=0xbfffea14) at comm.c:731 #4 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6 (gdb)
each function call in this list is called a 'frame'. Frame 0 is always the one containing the currently running code.
You'll see that gdb is telling us file names and line numbers for each function call, and it's also showing the arguments that were passed in to each function. All useful stuff.
We can look in more detail at a particular frame by switching to it with the 'frame' command. Let's look at frame 0:
(gdb) frame 0 #0 0x080caf08 in close_socket (d=0x11577608) at comm.c:2701 2701 d->snoop.snooping->desc->snoop.snoop_by = 0;
You can get a bit more context above and below the line in question with list:
(gdb) list 2696 sprintf(buf, "maxdesc is now %d", maxdesc); 2697 2698 /* Forget snooping */ 2699 if (d->snoop.snooping) 2700 { 2701 d->snoop.snooping->desc->snoop.snoop_by = 0; 2702 d->snoop.snooping = 0; 2703 } 2704 2705 if (d->character)
You can use the print command to find the value of variables at the time of the crash. Let's start to pull apart d->snoop.snooping->desc->snoop.snoop_by:
These two look reasonable - valid looking pointer values. We can have a look at some of the values in the char_data structure to make sure it looks ok:
ok, that's why we crashed. d->snoop.snooping->desc points to 0, which is an invalid memory location. SO when the code says d->snoop.snooping->desc->snoop.snoop_by = 0, it tries to assign 0 to a part of a structure sitting at memory location 0, realises that there isn't memory at that address, and crashes.
From here on in, it's detective work. You don't need to know anything in particular, or be a member of some super-secret clan or something, you just need to have an inquisitive mind and a decent tool for browsing and searching the source code: a tool that makes it easy for you to do text searches (grep ?) so that you can work out where things are defined in the code and where they are used. The Sloth code is too big for one person to remember it all now, so when any of us debug a crash, unless it happens to be something we're working on at the time, we almost always start off by reading the structure definitions and the code concerned, and working out from there what's going on. The more you read the code, the more you'll understand it.
Don't expect to be able to work out the cause of every crash. Sometimes the thing that crashes the code overwrites a lot of the useful information, and when you try to do a backtrace, you either get something that clearly doesn't make sense, or you get a pile of question marks from gdb. Other times, it's just too hard to work out what's gone wrong. And sometimes the cause of the crash happened a long time before the crash itself, so there's only limited value in looking at the crash itself. One example of this is memory corruption. Memory corruption is typically caused by referring to memory after it has been freed, or by freeing something that wasn't allocated in the first place, or by running over the end of a piece of allocated memory. These errors are particularly difficult to track down, because the effects often don't surface for a long time. If you find a backtrace that ends in a call to malloc_consolidate(), you are looking at a memory corruption problem and it's not worth examining the crash file any further.
I hope that you will find this helpful. Although its a long mail, it's easy to get started. Just do gdb sloth core.*, type bt and look at what comes back. If you get into the habit of doing this regularly, and reading around the lines of code that caused the crash, you'll soon get to the point where you can start to spot patterns and kill a few bugs.