1999… One day I worked for a company I liked and then next day I was part of a major corporate conglomerate. The days before the deal everyone was told from the highest levels that things would not change so much– at least not for a while. Reassurances were made that certain people would keep there jobs and that those jobs would be crucial to the new merged company. Mostly we were told that this was a merger and not a buyout. All that changed the day the deal was completed.
The HR department ended up spending the entire night processing the list of people who would get laid off the next morning. Then there was the paperwork that needed to be generated for the people would would allowed to stay. No longer would you be an employee of Netscape, but an AOL drone. The offers to stay were boilerplate and everyone was given a gift of 100 stock options. The options though, were completely worthless at a strike price of $108.00 on an already sinking stock. Many people left after the deal even though they had not been laid off. Many people left before the deal with completely on shear principle. Others, like yourself, stayed on, still believing that there really wouldn’t be that much change and that things could not get that bad.
Initially, after the shock of the layoffs was over, things had stabilized and were fine. Life was nearly normal. For a few months nothing really changed except some additional responsibilities and the addition of a new project. The drink machine remained free and the price of snacks didn’t change, so overall, everything was fine.
Work continued and a new project lent itself to discovering more and more disturbing things about the “merger.” Since the computer systems that would run this new service you were about to bring up would be housed in AOL datacenter space, you ended up having to deal with more AOL personnel than would normally be the case. You discovered their internal disorganization and the great inertia to change. There was one way to operate and that was the AOL way. Even when that way was old and outdated and served no function, that was the only way to operate. Groups within AOL were highly compartmentalized and there was little to no communication between groups. Attempting to get the AOL NOC to get in touch with the persons responsible for a machine was a nightmare of fruitless hunting. They didn’t have the answers and it took them hours to find the answer. Then, typically, the person in question would be completely unreachable except from their desk.
The event that broke the camel’s back, though happened one lovely Thursday morning when a PDU (Power Distribution Unit) failed back at AOL’s main datacenter. Even though all your machines had dual power and been specified as being powered through redundant PDUs, they were down. The first call to the AOL NOC went something like this:
“Hi, this is Andy from Netscape Operations, and the servers in section Z, row 24, cabinets 4 through 9 are unreachable. What’s going on?”
The NOC drone responds after a pause, “We had a PDU fail, sir.”
“Those machines are dual-power they should be on a redundant PDU.”
“Uh, I don’t know. I’ll look into it. What’s your number so I can call you back in a half hour.”
A half-hour turned out to be the minimum turn around time for any NOC request, I would come to find out.
“Wait,” I asked, “What’s the ETA on the PDU getting fixed?”
“Not sure, sir,” the drone responded, “I’ll get back to you on that.”
For thirty minutes, I ping and retest to see if the site has returned. At the fifteen minute mark, per Netscape protocol, I escalated up the chain of command to let folks know what’s down and why. The lack of an ETA doesn’t make anyone happy. A few minutes after the 30 pass and I get my call back.
“What’s the story,” I ask without allowing for a greeting.
“The PDU will take another 4 to 5 hours to replace. Then everything should come up.”
“WHAT,” I shout down the line,
“These are production systems. They shouldn’t have been allowed to be down this long. Why aren’t the secondary power plugs in a different PDU?”
“I don’t know. The ticket on the order for power was closed as completed.”
“Well, can we get someone to change the plugs now,” I ask attempting to calm down.
“I could create you another ticket. But since all the power folks are working on the PDU, I doubt they’ll be able to look at this until Monday,” the drone responds without emotion.
“Make the ticket and escalate it sev 1,” I demand.
The ticket gets made and I’m told I’ll get a call back in another half hour on the status of that and the original PDU. Through the regular half hour calls, I badger my management chain to bug their AOL peers. The lack of availability by desk phone, mobile phone, email, or pager strains credulity. While contacts are made and home phone numbers exchanged, no real progress is made. In slightly more than four hours, the replacement PDU comes online and my servers start lighting up. But not all of them.
I call the NOC, “Has power been completely restored?”
“Some of my servers are not coming up. Can you check if they have power?”
“Sure, let me get back to you in a half hour,” comes the typical response and not unexpected.
At least at this point I have more work I can do to continue restoring the system. By the time I get the call back to tell me that the machines have power and show as on, I have a theory as to the problem.
“Can you get someone on the console and have them type boot on the following machines,” I ask the NOC drone.
“Sorry, I can’t do that.”
“That’s fine. Just find someone who can.”
“No, sir, I can’t have anyone do that until after the change freeze.”
A change freeze is where nothing is modified within a system for a given amount of time in order to ensure stability. It’s fairly standard practice, but usually break-fix work does not apply.
“What do you mean there’s a change freeze? This is a production system which is down. It should’ve been back up hours ago. When can someone get in there to fix it?”
Without a sense of urgency the drone responds, “The freeze is in effect from four pm Thursday through Monday at 9.”
In a fit of apoplexy, I hang up my cell phone and then start slamming the receiver of my desk phone. The latter being so much more satisfactory. Running to find my VP, I explain the inanity of the situation with far too many expletives to be politically correct. A few calls and 45 minutes later some monkey is in the datacenter typing boot on about a dozen servers and validating that they come up. After giving the all clear, your boss and VP come over to tell you to go home.
“You can write up the post-mortem later in the week. Get some rest.” “Yeah, you look like you might kill someone.”
I apologize for my behavior and slouch in my chair.
“It’s fine,” explains the VP, “They need to learn from us. It is completely unacceptable how much of their infrastructure is down at any time. And the lack of response is really bad.”
I shrug as they leave and pack up my gear and go home.
As much talk as their was about AOL learning from us, the fiefdoms and bureaucracy were too well entrenched to really change. Netscape was just another conquered bit of territory. Bits of it were shut down or given away to partners. Over time almost everything was moved from California to headquarters in Dulles, Virginia. Before that happened, though, I left for a startup. It was a better choice than waiting to get laid off some other week. It was a better choice than walking around the empty sad halls of Netscape.
A startup. That was the ticket…
Alcohol content: None (They killed Beer Friday)