Another Law of System Administration
Let me propose this law, to see if it grabs anyone:
There is no task, software package, or piece of equipment so simple that it cannot be made to require a dedicated admin.
There are supercomputing centers with so many RAIDs that normal failure rates and ordinary statistics show that there will be several disk failures per day, so they have a person whose job it is to go around and replace the disks that died overnight.
And a friend just found a job opening for a Subversion admin.
I can vouch for that. I used to work on the DOE’s ASC program, and it’s amazing the rare failures you see when you have thousands and thousands of nodes running 24×7. I wasn’t one of the people who kept all the mechanical stuff going, but I knew people who did, and it was a big task.
Another interesting issue is power grid maintenance. Occasionally, the whole thing has to go down for work. As you probably know, it takes a lot less to keep an almost-dead hard drive spinning than it does to start an almost-dead hard drive spinning from 0 RPM. Imagine taking that number of hard drives and spinning them all down and then starting them up again, just crossing your fingers that you don’t have too many “dead on their feet” drives that won’t come back. What are the odds that you’re going to get hit with a catastrophic failure? The sysadmins hated that.
Heh. BlueGene and similar facilities were what I had in mind when I was writing about disks, above.
Oy, do I know. From painful personal experience. One of our buildings (the one with the biggest machine room, which also happens to be situated on a flood plain) has very flaky power, so we’ve had more than our share of power outages over the years. So I’m grateful for RAID, which means that losing a disk isn’t as big a deal as it used to be; and for cheap disks, which make it possible to have a decent number of spares.
Yep, RAID is definitely a life saver. What I hadn’t thought about until I started working with the storage guys is, what’s the probability of 2 hard drives in a RAID-5 array going down before the hot spare kicks in? I don’t know what BGL’s storage is (I worked on the storage for its predecessors and had some hand in testing for BGL before I moved on–I believe that the supercomputer I worked on is now essentially BGL’s disk controller), but I know that we were working on file systems that were on the order of 80TB, and I think that the drives that made up the array were 36GB SCSI drives. Add to that the fact that the arrays were some weirdo tiered combination of RAID-5 and RAID-0, and you have a lot of disks per partition. There was a steady stream of crates of drives moving in and out of the machine room.
Oh, and the blinkenlights. So beautiful. Don’t stare for too long or your you might die.
I visited the high-performance computing facility at Oklahoma U. last year, and our host there talked about how visiting VIPs were always more interested in the less-important parts of the machine room, like disk arrays, rather than the banks of processors, which is where all the stuff really happens. The reason is simple: each drive has a status light, while a rack of processors is generally just a black cabinet.
He did some research, and found that there are companies that sell panels of blinkenlights that fit a standard rack and can thus make your machine room more attractive to People With Money.
Also, years ago, I spoke to someone working at some big center out west. They were visited by some Hollywood folks who were going to make a movie involving computers, and wanted to see what a real data center looks like. He was a bit disappointed that they showed no interest in taking pictures of the Connection Machine 2 that he had. When he asked why, they said that movie audiences were more sophisticated than they used to be, and knew that computers don’t have huge numbers of blinky lights, so even though it was a real machine, it wouldn’t look realistic.