I've been researching this since gcc4 came out was part of the official linux distro I use and still am no closer to an answer. Hopefully you guys can give me some thoughts before I pull out what's left of my hair.
My mud compiles just fine. But when I run it, odd an unusual things happen. Now for instance when a player tries to connect the mud *may* crash, or it may not. Doing a stack trace with gdb shows me zilch. Everything looks exactly as it should be. I have not altered nanny( ) in any way to change how sockets are handled. I'll either get a crash in comm.c at check_reconnect at line 2887 which is:
if( ch->deleted ) continue;
an info locals in gdb shows nothing as well: ch = (CHAR_DATA) 0xa7000000 and that's it. ch->deleted is set to FALSE or TRUE before this. The only reason I can think of is that ch is not completely initialized and is running out of memory…but shrug. I haven't touched the functions new_character or clear_char, both of which are in comm.c
*Or* I get the bug "Char_from_room: ch not found" even though char_from_room isn't even called during the connection process. Now things are getting worse. Things that I haven't touched are now randomly breaking. str_dup( ) now causes a buffer overflow (And that's a bitch since the code is heavy with str_dup( ) uses) Strings and buffers aren't being flushed properly and will randomly display characters or alter character profiles like ch->long_descr being filled with garbage from the buffer so I have a room full of mobs and players like - (&*^*&ed345feqw is here.
Lists are not filling properly, like new_obj->next = obj; obj = new_obj; will result in both obj *and* new_obj->next containing exactly the same fields as if obj was set equal to new_obj somewhere beforehand, but I'll be damned if I can find where.
None of this happened before I switched to gcc4. I always tried to keep my code neat and clean so even when I switched to gcc4 my mud had no problems compiling…execution is another matter.
Any thoughts or ideas are appreciated. I've been at this for months now and I'm at my wits end. The only conclusion I can come up with is that after 10+ years of code changes and additions maybe the code is just getting too big. I dunno.
I showed everything that info locals gave me. Backtrace shows:
Program received signal SIGSEGV, Segmentation fault. 0x080792b5 in check_reconnect (d=0xa7a098dc, name=0xa7a0b839 "Theodryck", fConn=0 '\0') at comm.c:2887 2887 if ( ch->deleted ) (gdb) bt #0 0x080792b5 in check_reconnect (d=0xa7a098dc, name=0xa7a0b839 "Theodryck", fConn=0 '\0') at comm.c:2887 #1 0x08076e5f in nanny (d=0xa7a098dc, argument=0xa7a0b839 "Theodryck") at comm.c:1999 #2 0x08074e51 in game_loop_unix (control=7) at comm.c:925 #3 0x08074790 in main (argc=2, argv=0xbfa54284) at comm.c:488
I'll either get a crash in comm.c at check_reconnect at line 2887 which is:
if( ch->deleted ) continue;
an info locals in gdb shows nothing as well: ch = (CHAR_DATA) 0xa7000000 and that's it.
I can't say for certain, but that number looks suspicious. What are the odds that your player data structure just happened to land on such an even memory boundary? That usually happens when ch gets overwritten, or mistakenly assigned (as in ch = AFF_FOO instead of ch->flags = AFF_FOO).
Judging from the rest of the posting, there are also issues in the shared-string code. I would probably suggest running the game in gdb and setting breakpoints in places like the functions that load mobiles and instantiate them into the game world. See if those look like they're actually getting set up properly.
That's exactly what I've been doing. For the past several weeks. So far gdb hasn't really been all that helpful but it's all I have to work with right now. And thanks for the input guys. I'm pretty sure another pair of eyes will help me find the error.
I found it. Well, part of it at least. One of the gcc header files wasn't properly initializing sigaction ( actually it was bits/sigaction.h which is called via signal.h ). Everytime the mud would hit that part of the code it would either pass through just fine or trip SIGSEGV (Segment fault). That combined with accept() being passed arg 3 (size) as the wrong type (size_t rather than socklen_t) made any connections unpredictable and unstable. So far no more crashes when a player tries to log in. But just as sure as I type this someone is probably trying to log into the mud and it's crashing.
16 Jul, 2009, Theodryck wrote in the 11th comment:
Still chasing down string and buffer errors. I know it's glibc combined with gcc4, but tracking it is a pain in the ass. See, my mud compiles with 0 errors or warnings. But with the newer standard libraries that gcc4 uses being different in some ways from the old ones ( like the addition of glibc) and without me knowing exactly what ways is the pain since now I'm spending most of my time looking through header files and docs. It's like looking for a needle in a field of hay stacks.
17 Jul, 2009, Theodryck wrote in the 13th comment:
Yeah, I've been using valgrind along with gdb. Valgrind is how I found that sigaction was not properly initialized. But according to valgrind there aren't any memory leaks. 0 possible and 0 definite mem loss is what I get. The only thing it shows is possible uninitialized variables (like sigaction).
But you're saying that you're getting memory errors – it is extremely unlikely that valgrind would miss those. Are you sure that the problem is being reproduced while running in valgrind? By the way, uninitialized variables can cause crashes too. Memory leaks won't (usually) cause crashes.
17 Jul, 2009, Theodryck wrote in the 15th comment:
I never said memory errors. I said I thought ch wasn't being properly initialized and was running out of memory. That's an initialization problem, not a memory problem. And I was mostly correct about that since the descriptor wasn't properly being handled therefore a stable connection was never achieved. So as a result passing certain parameters of ch (like ch->deleted) would crash the mud since it would have garbage rather than the expected value or type.
You were talking about "string and buffer errors" and something happening that causes segmentation faults. Those sure sound like memory errors to me, but if you'd to call them something else that's fine too… Regardless, yes, trying to use uninitialized values is very likely to cause problems, but valgrind will also (usually) spot usage of uninitialized memory.
17 Jul, 2009, Theodryck wrote in the 17th comment:
Well yeah, I can see how you call it that. But I would think of it more as a memory allocation error rather than a leak which is what valgrind is most suited for. As of right now I'm trying to track down why str_dup is filling the buffer random garbage rather than what it's supposed to and why linked lists aren't filling the way they should. Again, I know it's related to glibc and the updated header files but I haven't found it just yet.
No, valgrind is very good at finding all problems relating to memory, including but not limited to leaks, accessing invalid/unallocated memory, use of uninitialized memory, etc.
I'd be slightly surprised if this were a true bug in glibc – chances are it's due to misuse of some std lib function. I'd suggest going through all problems that valgrind reports and make sure that none are related to your problem.
I would also suggest using gdb to set a breakpoint where this happens so that you can examine exactly what is going on.
My gut is telling me to ask this - so I'll ask. You say the MUD compiles? Have you done a COMPLETE compile? As in a "make clean" then make? The kinds of problems you're describing sound more like the results of mixing object files from previous compiles with ones from the new environment.
If glibc itself was a flaming mess, a lot more people would have noticed by now. Including other MUD developers :)