/
short-1/
From erwin@pip.dknet.dk Thu Jul 25 16:20:14 1996
Date: Thu, 25 Jul 1996 16:20:12 +0200 (MET DST)
From: erwin@pip.dknet.dk
X-Sender: erwin@shire.shire.dk
To: MERC/Envy Mailing List <merc-l@webnexus.com>
Subject: Re: Lagout bugs (Merc 2.2 +)
In-Reply-To: <9607241909.AA15457@quix.robins.af.mil>
Message-ID: <Pine.LNX.3.91.960725141310.158A-100000@shire.shire.dk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Status: 

On Wed, 24 Jul 1996, Mr. Jolt Cola wrote:

> >      Also, I was wondering if anyone has written a "polling" process for
> > linux that determines if a mud is locked up, and kills the process.  Any
> > ideas along these lines would be appreciated :)
> 
> You don't need another process. Use what the kernel provides, an alarm.
> You can do something like this.
> 
> (Pseduoish code)

Worked great :)

> If you do this you have to beef up your system calls.
> You need to check things like select() to see if errno returns EINTR
> because SIGALRM can arrive at any time.

Here is what I have done:

I started with aborting after X real time seconds have passed. Later, I 
wanted to add something so that the server would abort if something 
hanged while loading the area etc. files. However, using real seconds was 
a bit risky there - what if there was a big load on the machine? Setting 
the threshold high would work, but that would also mean that it would 
take longer to abort if there was no loads.

So, instead I measure user CPU time used by the process, and abort if a 
certain number of user CPU seconds have passed without checkpointing.

While booting the database, this threshold value is set to something 
higher than usually.

While testing, it took loading of 10,000 mobs with a single command (FOR) 
before the server aborted. This was on a 486DX4-100.

To avoid having to rewrite those system calls to look for EINTR (though I 
have already done so, so that profiling would work correctly) I use the 
SA_RESTART flag.

I don't know about the compatbility of this code; according to Steven's 
book, sigaction is POSIX.1 defined, so this should work under those 
systems. The below code was tested on a Linux 2.0.8 machine.

I placed all my code in update.c. Before calling boot_db() in comm.c, 
call init_alarm_handler(). Also, add alarm_update() to one of the loops. 
I'm not sure how expensive a getrusage() call is - you might want to call 
alarm_update() less frequently, and increase the RUNNING_ABORT_THRESHOLD. 
Experiment :)

To find a good BOOT_DB_THRESHOLD, alarm_update() prints the number of CPU 
seconds the MUD has used so far (but only the first time it is called 
after a reboot); note that this value will not be correct if you exec() 
the MUD using e.g. the copyover/hot reboot code, as those values are not 
cleared when exec'ing. You might want to set it higher - it all depends 
on the speed of your machine.

If you have some command that you KNOW takes a lot of CPU time, and you 
don't want it aborted, set the global variable disable_timer_abort to 
TRUE before starting, and set it to FALSE when finished. Things that 
might require that are something like an ASAVE WORLD in OLC.

Oh, and I assume that getrusage *is* reentrant. It's not listen in 
Steven's book, in the table in 10.6 Reentrant functions. I don't know 
though if this is a system call, if this data is located in user space? 
Later I found out that the times function was one I could use better, but 
then it was too late :)



#define BOOT_DB_ABORT_THRESHOLD			  15
#define RUNNING_ABORT_THRESHOLD			   5
#define ALARM_FREQUENCY                           20


/* Include files necessary */

#include <sys/resource.h>
#include <signal.h>

/* Interval in pulses after which to abort */
int	abort_threshold = BOOT_DB_ABORT_THRESHOLD;
bool	disable_timer_abort = FALSE;
int	last_checkpoint;


/* find number of CPU seconds spent in user mode so far */
int get_user_seconds ()
{
	struct rusage rus;
	getrusage (RUSAGE_SELF, &rus);
	return rus.ru_utime.tv_sec;
}

/* Update the checkpoint */
void alarm_update ()
{
	last_checkpoint = get_user_seconds();
	if (abort_threshold == BOOT_DB_ABORT_THRESHOLD)
	{
		abort_threshold = RUNNING_ABORT_THRESHOLD;
		fprintf (stderr, "Used %d user CPU seconds.\n", last_checkpoint);
	}
}

/* Set the virtual (CPU time) timer to the standard setting, ALARM_FREQUENCY */
void reset_itimer ()
{
	struct itimerval itimer;
	itimer.it_interval.tv_usec = 0; /* miliseconds */
	itimer.it_interval.tv_sec  = ALARM_FREQUENCY;
	itimer.it_value.tv_usec = 0;
	itimer.it_value.tv_sec = ALARM_FREQUENCY;

	/* start the timer - in that many CPU seconds, alarm_handler will be called */	
	if (setitimer (ITIMER_VIRTUAL, &itimer, NULL) < 0)
	{
		perror ("reset_itimer:setitimer");
		exit (1);
	}
}

const char * szFrozenMessage = "Alarm_handler: Not checkpointed recently, aborting!\n";

/* Signal handler for alarm - suggested for use in MUDs by Fusion */
void alarm_handler (int signo)
{
	int usage_now = get_user_seconds();

	/* Has there gone abort_threshold CPU seconds without alarm_update? */
    if (!disable_timer_abort && (usage_now - last_checkpoint > abort_threshold ))
	{
		/* For the log file */
		write(STDERR_FILENO, szFrozenMessage, strlen(szFrozenMessage+1));
		abort();
	}
	
	/* The timer resets to the values specified in it_interval automatically */
}

/* Install signal alarm handler */
void init_alarm_handler()
{
	struct sigaction sa;
	
	sa.sa_handler = alarm_handler;
	sa.sa_flags = SA_RESTART; /* Restart interrupted system calls */
	sigemptyset(&sa.sa_mask);

	if (sigaction(SIGVTALRM, &sa, NULL) < 0) /* setup handler for virtual timer */
	{
		perror ("init_alarm_handler:sigaction");
		exit (1);
	}
	last_checkpoint = get_user_seconds();
	reset_itimer(); /* start timer */
}