Sysadmin Theory
February 5th, 2008Proper system administration, to me, involves 5 main areas:
- Documentation
- Repeatability
- Redundancy
- Monitoring
- Backups
Whenever I’m starting a new project or modifying an existing project, I try to consider each of these 5 general topics. Most importantly though, don’t stop learning. New tools are always coming out, new technologies, new ideas, better ideas. So please, help me improve my own theory. I love new technology and almost drool over it. I love improving my skills and practices, so please contribute! I crave improvement.
Documentation
Documenting everything you do is very important. Even if you’re the only sysadmin on the job, documenting is essential. I’m tempted to say that documentation is the most important step in any system administration task. And, I’m pretty sure it’s the step most often over looked. Documenting brings lots of benefits:
- You’re not the only one who can do it. (Yes, this is a good thing. Think “summer vacation”… with no cell phone.)
- Four months after designing a QA cluster for an app and you now have to recreate the same exact cluster in production, documentation is good.
- Years later when you’re wondering “why in the world is this part designed like this?” it can help answer such why questions.
- Should you move on to another job, hopefully the previous sysadmin there felt the same way about documentation.
Use a wiki for documentation. Wikis are great:
- Central spot to go for all documentation.
- Quick and easy mark up.
- Built in search.
Repeatability
It is important that what you do can be easily repeated. While documenting is great, it does not mean it will be followed. I think it’s just human nature. We all use documentation the first 5, 10 or 15 times. But then we’re experts. Problem is that it’s kind of like that “telephone” game. Little bits of information just fall out or get distorted and after you repeat the same task the 100th time, you’re doing it wrong.
To avoid such issues almost completely, the solution is scripting. Scripting doesn’t mean no documentation. If anything, it means more documentation. Document the task and then document the script that executes the task. I strongly believe that good system administration involves a good amount of programming.
Other automation tools can come in handy too. I especially like customized Kickstart installs for consistently deploying servers for a specific task.
Redundancy
Kind of an obvious one here. Single points of failure must be killed, for your uptime and for your sanity. Many things can go wrong. When things do go wrong, it’s so much easier to get the wrongs fixed if, to everyone else outside of your department, there are no signs of the wrongs. So make a list of your current single points of failures and then start adding in redundancy.
We once had almost half of our servers go off due to a power failure and the only symptom was our cell phones being REALLY annoying for a while. This was around 20 servers that went down, all without notice outside of IT.
Monitoring
Proper monitoring is all about managing your signal to noise ratio. There are great things out there that send a barrage of noise your way. Things like tripwire, logwatch, Big Brother, Nagios and cron. What you use for your monitoring greatly depends on your environment. If you’re running just 5 or so servers, sifting through the noise you get to find the signal isn’t that hard. If you’re managing 100+ servers, you have to crack down on the noise or you’ll miss some important signal.
Cron can be evil. A lot of developers give you noisy noisy cron scripts to run every hour or so. Send them back to the developers and explain that the script must either:
- Output nothing to stdout or stderr and properly log to a rotated log file clearly marking lines that need attention with ERROR (you then monitor this log file)
- Output nothing to stdout. If it needs attention output to stderr. (you then redirect stdout to a rotated logfile, anything on stderr will be emailed to the MAILTO at the top of the cron file… oh yeah, use cron.d)
Out of the box most monitoring tools (like Nagios) will check the basic things for you:
- Ping checks
- Disk space
- CPU
- /var/log/messages
In addition to those basic checks, getting in as much other monitoring as possible without adding too much noise is always a good thing. Application logs, website checks, hardware, custom scripts, etc.
Does your server have hardware raid? How do you know when a drive dies? Is the process clamd important for your server to perform it’s task? Monitor it, make sure that process is always running. Does the server host a web site? Do a http check with a string validation. Are there two power supplies in the server? How do you know when one dies?
The goal of all this monitoring is to be preventative and not reactionary. Here’s an easy example of reactionary thinking vs preventative thinking:
Reactionary: You get paged once a week because /opt gets to 95% full. So once a week you get on that server and search for files to prune to get it down to 90%.
Preventative: You get paged two or three weeks in a row because /opt gets to 95% full. So you figure out how to better manage the files being written. Most likely there’s a log directory that can be configured to be rotated with log rotate, or a cron entry you can put in to auto prune archives.
Another example:
Reactionary: You get paged that a machine is failing ping checks. You check it out and notice that two hard drives in the raid array have died and the machine is toast.
Preventative: You get paged that a hard drive failed in your main logical drive. You call in and get a new hard drive and rebuild it before the 2nd one goes.
Backups
First, the obvious: Backups are life savers. Auto rotating, low maintenance, with hooks in to your monitoring system backups are multiple life savers. The only often overlooked step to good solid backups is… restores. Test your restores. Find an important file or directory and attempt to restore it. It really sucks when you find out your backup plan wasn’t backing up something important when the CFO comes and asks you to restore an excel spreadsheet on the shared drive. Test your backups. Monitor for errors.
Other than that, I love rcs. I like to keep important configuration files in rcs, so that along with the usual backups, I have a quicker restore option with a history of “why” a change was made. Things like MySQL’s my.cnf, dns zone files, ftp configuration files, etc. To help enforce the use of rcs (help, it by no means forces everyone to use it) I set up scripts like so:
# cat edit_my.cnf.sh
rcs -l my.cnf
chmod +w my.cnf
vim my.cnf
ci my.cnf
co my.cnf
This will force the file (”my.cnf” in this case) to be read-only, even for root. So if someone comes through and just vim’s it, they will be yelled at by vim that it’s read-only, and maybe a helpful comment at the top: “#### DO NOT EDIT DIRECTLY, USE ./edit_my.cnf.sh INSTEAD ####”.
So after the user runs ./edit_my.cnf.sh, if they make changes they will be forced to input a comment on what they did and why. This has saved me lots of headaches when we had 4-5 people who commonly modified the same file.

















2 Responses to “Sysadmin Theory”
By Rich B. on Feb 6, 2008 | Reply
Some tips from an old-timer:
a. As admin, rather than delete a file immediately consider renaming it and submitting an ‘at’ job to delete it some time in the future. This way you’ll know if it’s missed, and can restore it quickly.
b. When deleting large numbers of files in a directory (common with email), using the ‘find’ command in conjunction with ‘xargs -n’ to cause the delete to run about 50 times faster than any other approach.
c. Use find with ‘cpio -pdum’ for doing quick directory-to-directory selective backups.
d. Use ‘find -newer’, ‘touch -r’ and ‘cpio’ to do quick incremental backups.
e. On systems with multiple users and databases, don’t hesitate to keep some larger old files around on some file systems as ‘ballast’ in case you need to restore some capacity quickly in the future. (Corollary: *never* give a system to dba’s for database configuration without first putting some ballast files in place).
By A.C. on Feb 6, 2008 | Reply
Another good introduction to the principles of System Administration can be found in chapter 1 of the Red Hat Linux System Administration Primer:
https://www.redhat.com/docs/manuals/linux/RHL-9-Manual/
It approaches the same principles and practices from another direction, but arrives in basically the same place, philosophically, as your article.
The rest of the Primer tends to be more Linux and Red Hat specific, but, with a little effort, you can find equivalent tools for your to do the same tasks described in the Primer.