Saturday, May 06, 2006

Unix Admin Tools

I'm not really big into system administration.

But yesterday, one server I help manage had a hung perl process that was sucking up all of the system's resources. I only discovered it after I made some apache configuration changes. Unfortunately, apache wouldn't restart and a dozen web sites went down thanks to the offending process.

An hour or two later, I was working on another system along with 2 coworkers. Every program we started seemed to take forever to complete. In this case, the culprit turned out to be two long-running Matlab processes a student was running overnight. Normally we might not have noticed, but since the 3 of us were pulling a night shift during an experiment, the slow-down on a crucial data processing server was rather annoying.

As a result of all this, I learned some nifty new unix commands, thanks to my sys admin friends.

Provides a listing of system processes sorted by CPU utilization. On many systems the listing refreshes automatically every few seconds until you hit Ctl-C. This is a great tool for finding the processes (and users) who are hogging the system.

This command outputs some general system information, including 1) the current time, 2) the length of time since the system was last rebooted, 3) the number of users currently logged on 4) and CPU load statistics for the last 1, 5, and 10 minutes. Here's a quick sample:

2:39am up 352 day(s), 12:07, 10 users, load average: 2.68, 2.94, 3.30

It's the load averages which can be most telling in debugging hung processes or sluggish performance. A value of 1 means that for every clock tick, there's a process waiting in the wings for the CPU to handle. If it's greater than 1, then you've got a lot going on, and any process you run will take longer to finish. If it's less than 1, then your CPU is sitting idle and lonely, so you should give it more to do.

If you have a job which will take a long time to run and you're sharing the system with other users (or perhaps it also processes critical real-time data), you can use this command to give your job a lower priority. Remember, we're all friends here.

Finally, if you forget to play nice when you start your process, you can use the renice command later, by simply passing the process id.

Hopefully, if you're on a system with many users, you'll find these commands useful to figure out what's going on.

And for those of you with mace, you can use all of these commands too!