Sysadmin 101
Introduction
Here, I've written down bits of wisdom I've collected over the years on the general art of system administration. Comments, suggestions, and well-written diatribes are welcome. Send all that to feedback@openesque.com.
A guiding principle:
It should be straightforward for any reasonably skilled sysadmin (skilled in your particular OS and applications) to find whatever they need to maintain your systems. If this isn't the case, then your documentation or organization is lacking.
Overview
Less is more. Document what you do. Do what you document.
Sysadminning combines all the work and frustration of programming with none of the glamour. The best compliment most sysadmins will ever get is complete anonymity. If nobody knows who you are, you're either doing it right, or you're not doing anything at all. Or, maybe both.
The essentials
There are only two kinds of computer systems: production systems and scratch monkeys.
Don't play games with your production systems. Track whatever stable path exists on your particular operating system. Don't install anything you don't absolutely need; uninstall everything you've discovered that you don't absolutely need. Turn off every service you don't absolutely need. This does not mean to gut your system; rather, it means to avoid the fluff. You don't want to throw your tools away. Everything installed on your system is a potential source of a crash, impaired performance, or security compromise.
Get a hard-cover, numbered record book. National (Avery Dennison) 56-231 300-page record books are my personal favorites. You can find 'em at Staples: they're expensive (~ $30), but they're sturdy, and provide an essential forensics trail for figuring out what went wrong and when it did. Write down EVERYTHING you do to your systems. Err on the side of overdocumenting; you'll invariably neglect to write down the one little thing you should have written down.
Treat it just like a lab notebook: date when date changes; note times you do things. Only write notes on the right side (odd-numbered pages;) leave the even pages blank for use for notes, quick figuring, etc. Don't grab a legal pad; scrawl on the left instead. That way, you know where your old scrawls are. And, when the phones are ringing and people are screaming, you only have to look for one book with all your notes in it.
Put critical passwords (e.g.: root on all your boxes), passphrases for RSA/DSA keys and PGP/GPG, etc. in your Rolodex. Date every card when you create it. When you change passwords, IP addresses, etc., create a new card, mark the old one OBSOLETE and date when it became obsolete, and put it immediately behind the new one.
NEVER throw away a card; if you try to recover an old system from backups, you may need that two-year old root password.
When that starts getting full, get a second Rolodex just for obsolete cards.
Scalability of paper
Someone is bound to note that this system doesn't scale well to thousands of servers. True enough. But neither do sysadmins, unless they really have their acts together. If you have documented procedures for day-to-day operation, you need not write down the same thing day after day, if you're following procedures. You need only document anomalies and deviations from procedure.
It should be straightforward for any reasonably skilled sysadmin (skilled in your particular OS and applications) to find whatever they need to maintain your systems. If this isn't the case, then your documentation or organization is lacking. You will be unavailable to the office some day, for some reason.
Backups
Do them. Like clockwork. Skip payroll before you skip backups.
Practice recovering from your backups periodically: every few months or so. Prove you can do it. Do this on a scratch monkey if at all possible; don't break your production system testing your restore from backup!
If you have a PFY, this is a perfect thing for him or her to do. If your PFY can't figure out how to restore from backups, then you don't have your act together. Fix your docs.
Keep an archive system in addition to your backup system. By archive system, I mean periodic full backups of all your files which are never rotated out. They are stored for eternity.
At a company which shall remain unnamed, quite a few revisions of firmware for a core product were lost because the version control system archive files became corrupted. Nobody noticed, because the old versions were never used, and the VCS used reverse-delta storage, so the error was unnoticeable until someone finally tried to retrieve an old version. Oopsie. If an ancient version of the VCS files existed, we could have merged the old data into a newer file and the problem would have been solved. Unfortunately, since the backup tapes were rotated out over a course of six months, and there were no older archive tapes, there was no way to go back and do this. The older versions were lost forever.
A daily backup of a corrupted file is worthless. Keep archive copies of anything that matters.
Preparing for a "paper crash"
All of this goodness will be lost if you have a fire or some disaster that destroys all your notes when you need them most. Type up your notes and passwords and store all that in a safe deposit box offsite. If you keep up on this (PFYs are wonderful for this tedious task,) you have the advantage of searchable notes. But the paper is still the primary source. Invariably, the system that just crashed is the one you keep your notes on. If your building burns down, you'll probably have time to go fish in the safe deposit box.
Procedures
Good system administration requires having a keen understanding of your systems. It also requires good documentation and procedures: reducing as much as possible to routine, even boring, work. If you don't do this, you'll never again get to take a vacation without a pager. Build your procedures, test them out on your scratch monkeys, then deploy in "real life."
Never trust your memory when you have a procedure handy. Pilots have checklists for a reason; you should also. This is especially true for remote systems… forget one step and and it's road trip time. You won't enjoy the trip one bit.
BOFH is a joke, y'all
A favorite piece of sysadmin amusement is The Bastard Operator From Hell, which describes the work days of an evil sysadmin who makes the life of his lusers a living hell. Go ahead, laugh. It's funny (some installments more so than others.)
But, in the end, a sysadmin has a job not because the company cares about having sysadmins, and not because they care about having computers, but because they care about making money. You, the sysadmin, are overhead by definition.
Yes, it would be easier to run the hotel without all those guests… but that ain't gonna happen, or you'll be laid off. Deal with it. Be nice to your lusers. Direct them to the helpdesk if there is one. Chances are, if they're asking you, you are the helpdesk. Think about it: are you really evil enough to wish upon a fellow human being that they come to possess all the computer lore that you do? Remember what it took you to get where you are today?
Conclusion
A sysadmin is very much like a commercial pilot. Most of the time, the job is hours of tedium interspersed with moments of terror. As a sysadmin, it's your job to make the ride as uneventful as possible. The best way to do that is to keep your systems as simple and well-maintained as possible, with good operational procedures that will minimize the moments of terror.
It's true that a pilot has a better view out his window than most sysadmins, who usually don't have windows at all, unless they're Microsoft. And the flight attendants may be better looking than the other people in your server room. On the other hand, "crash" means something more serious to a pilot than it does to a sysadmin.
Ron Oliver, Manager
Openesque LLC
July 19, 2006 (freshened links; original July 26, 2002)