Timing is Everything
Part of the new job duties is that I’m occasionally scheduled to work shift duty on Saturdays. Working this shift means one gets the next Monday off, so I still get 2 days off as a pseudo-weekend and the previous 6-day week is balanced by the promise of a 4-day week to follow. Of course I hadn’t been copied on the schedule and, being the new guy, didn’t even know that there was one (or that I’d been scheduled). Luckily one of my coworkers noticed I was scheduled for this past Saturday and let me know Friday morning.
I don’t really care about working on Saturday, but I spent a good portion of Friday trying to figure out what kinds of alerts and problems I was likely to run into during the day, since I’d be manning the NOC alone. My work so far has been focused mainly on handling projects, so I haven’t been immersed in the daily goings-on in the NOC with the shift guys as much as I should have been. But what the hell, a trial by fire never hurt anyone…except those that get severely burned.
In order to make a good impression, and to make sure that the guy who’s been working third shift doesn’t have to stick around any longer than necessary after what I’m sure has been a long night, I plan to show up 15 minutes early. I go to bed at a very reasonable hour Friday night and get up early enough to insure I can make the 45 minute drive to the office and arrive ahead of my scheduled start, as per my plan. I get out to my car on time, climb in the driver’s side, slide the key home, depress the clutch and turn…
grrrrrrrrrriiiiiiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnndddd.
Oh what the fuck.
grrrrrrrrrriiiiiiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnndddd.
You have got to be kidding me.
grrrrrrrrrriiiiiiiiiiiiiiiiiiiiiiiiiiinnnnnnnnnnnnndddd.
Luckily, I own a motorcycle as well as a car (”own” being a cute term for “the bank owns, and I pay them”). Even though the fog was thick this morning, I figured it would be better to get wet and cold than to miss my first shift because my car crapped out for the first time in 55000 miles of service. I may not have windshield wipers to deal with the water that builds up on my face shield, but if you turn your head at 75 MPH, the wind does a pretty good job for you.
As I’m riding in I’m thinking to myself: “Does this bode well for my first day on shift?” Then I recalled that everyone at work told me that I’d be as bored as a cannibal at a fruit stand. That just made me more nervous. The more that people tell me how easy or boring something work-related is, the more I imagine it’s going to be painful.
So I get into work and immediately hit a locked door. Apparently one needs a key the get in these outside doors, separate from the keycard that gets us access everywhere else. Luckily, someone had blocked open another sidedoor, so I (to use the Massachusetts colloquialism) booked it inside, up the elevator and into the NOC only a minute or two late. The third-shift guy tells me he hopes I have a good online game to play, or I’m going to be bored all day. I laugh and grit my teeth and prepare for the inevitable shit-storm.
By 9:00am I’ve been busy. I’ve been recycling app pools, restarting dead services and watching the environment. By 10:00am I’m still busy. I’ve been recyling still more app pools, replacing dead drives in SCSI-attached storage arrays and replying to email alerts. At 11:00am the power switches off and cuts over to gen. The lights go out and the Air Conditioners switch off for a few seconds, before coming back to life. I nearly have a heart attack. Luckily, I’ve had lots of experience with power cutovers and recognize that the rest of the building is fine and that we’re safely on generator in the data center. So I leave a message with my boss to let him know what happened. He calls me back and sheepishly explains that they do a weekly generator exercise and power cutover test every Saturday. I laugh it off. At 11:30, the power switches back to street. The lights switch off and the ACs rumble to a halt before coming back on a few seconds later.
That’s when I notice the sound. The AC right next to the door from the NOC into the data center, one of three, doesn’t sound right. It sounds a bit more labored, and the tone is higher than it should be, as though it’s not really doing its job. I get a bit nervous, but decide to let it run for a moment. I go back to the NOC and then, after a few more minutes of checking emails, back into the data center. Is it getting warmer in here? And then:
BEEP BEEP BEEP BEEP BEEP BEEP
The alarm console on the AC starts beeping to let me, or anyone nearby, know that there are three new alerts that need looking at before it’ll stop complaining. High head pressure 1, high head pressure 2 and Secondary GC pump on standby. Oh shit. The AC system has well and truly fucked off and died. I check the temperature (up to 80 from its usual 75) and head back into the NOC to call the boss. He doesn’t pick up, and I leave a message letting him know that I think the AC just shit itself. Instead of waiting for a call-back, I dig up the notebook with the AC service contact information and call the emergency number, getting a page sent to the on-call AC tech. I then call the second in command for my group, which begins the process of rallying the troops.
Suffice it say, that after many attempts to get the AC running, reset the pumps, resetting the compressors and generally praying, we eventually got the bastards to start cooling again. By the time they started doing their job again, it was already too late. We had lost a couple of RAM SANs to thermal-event shutdowns and we had to start bringing down as many systems as possible. It was over 115 in the room. The Itanium’s and the SANs would start eating their own faces soon. So the site came down until we got the room back down to 85 degrees. The NOC was now full of managers and sysadmins and SAN guys all working together to bring the site back to life. The AC tech had blessed the AC system and scheduled some time during the week to come out and give it a full workover. I’d been fielding phone calls, responding to emails, running around the data center restarting servers and generally doing whatever I could to help get everything back.
And after all of that, I only spent an extra hour in the office than I would have normally. Plus, I got some props for handling the situation with calmness and humor (I’ve dealt with AC/Power failures before and the one thing you learn about these big problems is that there is only so much you can do in a situation where downtime is inevitable, so just do them and then grin).
Needless to say, I didn’t spend a lot of time looking for games to play during my first day on shift. Amusingly enough, my next scheduled shift is Christmas Eve…I wonder if I’ll get rescheduled.


November 14th, 2005 14:06
I can’t fucking believe your company does _live_ tests of its generator.
So if the tests fail, everyone dies.
That’s… umm… different than the way most places I’ve been to do it. I won’t even get into the non-redundant cooling :)
You’re bad juju, helly.
November 15th, 2005 13:28
Bad juju indeed…my next job is going to be in YOUR data center! MUUUUAAAHAHAHAHAHAHAHAHAhahahaahahahahaaaa*cough* *cough* *cough*
As far as live tests go: in my opinion, one has to do a full power cutover test from street to generator on a regular basis (although not once a week for sure) to be sure that the data center responds properly. If you never test that your ATS switches between power properly, that your UPS’ can handle their load and that the generator powers on properly and stays up during load, you’re just asking to get blown out of the water in case of a real emergency. That being said, these live tests are best done with staff on-hand capable of troubleshooting any problems that arise and also at such a time that load on the systems is at their lowest so as to cause the least amount of downtime should a failure occur. Ideally, you should also be ready for a cutover to the DR site.
As far as the redundant cooling…tell me about it. Kinda crazy not to have that.