Suggestions on running long jobs
Reducing the impact on other users
To minimize the impact on the responsiveness of the system
the program can be run with a high 'nice' level (the highest
is usually 19) which corresponds to a low priority. Check out the web page
description on nice and nohup. Also, see the
man pages on 'nice' and 'renice' for details.
You can also
change the nice level of a running process using the program
'top'. With a high nice level, the program should not interfere
with someone actively using the system, but it will get all the
CPU when the machine would otherwise be idle.
Something to consider is the memory usage of the program. If the
program uses a significant portion of the physical memory, say
within 10-15 Mbytes of the physical memory size, then the performance
of the machine will probably be severely degraded when someone
is running X windows and a few applications (X + a few apps can
easily take up more than 10-15 Mbytes). The performance degradation
occurs despite having a high nice level since the machine has to
swap memory in and out to disk. This is something to keep in mind
when writing simulations - if the machine starts swapping then performance
really goes down. If at all possible, design the simulation so that
it isn't necessary to use a large portion of physical memory at any given
time. If you really need to use large blocks of memory, another idea is to
stop the program during normal working hours and let it run at night when
there is less impact on the other users. You can stop a program without
killing it by sending it the stop signal
(i.e. kill -STOP pid). The program
can be restarted later by sending the continue signal
(i.e. kill -CONT pid).
This could be automated in a cron job to stop the process in the morning
and continue the process at night.
Improving the odds of getting output from the program
The other area I'll mention is getting results from a simulation. If the
program will take a long time to finish, it is important to have some
mechanism of obtaining partial results. Otherwise, if the power fails
5 minutes before the simulation finishes after it has been running for a
week, you may have to start over from the beginning unless you have some
idea where it left off. The most likely cause of a program terminating
early is that the machine needs to be rebooted for maintenance or because
of software problems. You can have the program periodically dump some state
information to a file to guard against a power failure. For cases
where the machine is shutdown, the program will be sent the TERM
signal, so
it is possible to catch this signal and dump state information at that point before
terminating. You could also have the program catch other signals such as HUP
so it would dump state and continue - allowing you to get state information from
the program at any time by 'kill -HUP pid' without killing the program.
I have some example code which implements this signal catching/state dumping
concept. There are two source files and a header file in the example code:
A file containing functions which could be placed in a library and linked
with simulation programs to allow registering and dumping state values is
in state_lib.c. Its header file is
state_lib.h.
A file containing a simple example to show how the library routines are used
is in test_term.c.
A compressed tar file containing all the source files as well as a makefile
is provided also provided -> state_lib.tgz.
Welcome
· Projects
· People
· Papers
· Calendar
· Links
· Internal
© 1997 Information Coding Laboratory
Send comments to www@code.ucsd.edu
Last Updated: $Date: 1997/11/21 23:04:52 $