1999… One day I worked for a company I liked and then next day I was
part of a major corporate conglomerate. The days before the deal
everyone was told from the highest levels that things would not change
so much– at least not for a while. Reassurances were made that certain
people would keep there jobs and that those jobs would be crucial to
the new merged company. Mostly we were told that this was a merger and
not a buyout. All that changed the day the deal was completed.
The HR department ended up spending the entire night processing the
list of people who would get laid off the next morning. Then there was
the paperwork that needed to be generated for the people would would
allowed to stay. No longer would you be an employee of Netscape, but
an AOL drone. The offers to stay were boilerplate and everyone was
given a gift of 100 stock options. The options though, were completely
worthless at a strike price of $108.00 on an already sinking stock. Many
people left after the deal even though they had not been laid
off. Many people left before the deal with completely on shear
principle. Others, like yourself, stayed on, still believing that
there really wouldn’t be that much change and that things could not
get that bad.
Initially, after the shock of the layoffs was over, things had
stabilized and were fine. Life was nearly normal. For a few months
nothing really changed except some additional responsibilities and the
addition of a new project. The drink machine remained free and the
price of snacks didn’t change, so overall, everything was fine.
Work continued and a new project lent itself to discovering more and
more disturbing things about the “merger.” Since the computer systems
that would run this new service you were about to bring up would be
housed in AOL datacenter space, you ended up having to deal with more
AOL personnel than would normally be the case. You discovered their
internal disorganization and the great inertia to change. There was
one way to operate and that was the AOL way. Even when that way was
old and outdated and served no function, that was the only way to
operate. Groups within AOL were highly compartmentalized and there was
little to no communication between groups. Attempting to get the AOL
NOC to get in touch with the persons responsible for a machine was a
nightmare of fruitless hunting. They didn’t have the answers and it
took them hours to find the answer. Then, typically, the person in
question would be completely unreachable except from their desk.
The event that broke the camel’s back, though happened one lovely
Thursday morning when a PDU (Power Distribution Unit) failed back at
AOL’s main datacenter. Even though all your machines had dual power
and been specified as being powered through redundant PDUs, they were
down. The first call to the AOL NOC went something like this:
“Hi, this is Andy from Netscape Operations, and the servers in section
Z, row 24, cabinets 4 through 9 are unreachable. What’s going on?”
The NOC drone responds after a pause, “We had a PDU fail, sir.”
“Those machines are dual-power they should be on a redundant PDU.”
“Uh, I don’t know. I’ll look into it. What’s your number so I can
call you back in a half hour.”
A half-hour turned out to be the minimum turn around time for any NOC
request, I would come to find out.
“Wait,” I asked, “What’s the ETA on the PDU getting fixed?”
“Not sure, sir,” the drone responded, “I’ll get back to you on
that.”
For thirty minutes, I ping and retest to see if the site has returned.
At the fifteen minute mark, per Netscape protocol, I escalated up the
chain of command to let folks know what’s down and why. The lack of
an ETA doesn’t make anyone happy. A few minutes after the 30 pass and
I get my call back.
“What’s the story,” I ask without allowing for a greeting.
“The PDU will take another 4 to 5 hours to replace. Then everything
should come up.”
“WHAT,” I shout down the line,
“These are production systems. They shouldn’t have been allowed to be
down this long. Why aren’t the secondary power plugs in a different
PDU?”
“I don’t know. The ticket on the order for power was closed as
completed.”
“Well, can we get someone to change the plugs now,” I ask attempting
to calm down.
“I could create you another ticket. But since all the power folks are
working on the PDU, I doubt they’ll be able to look at this until
Monday,” the drone responds without emotion.
“Make the ticket and escalate it sev 1,” I demand.
The ticket gets made and I’m told I’ll get a call back in another half
hour on the status of that and the original PDU. Through the regular
half hour calls, I badger my management chain to bug their AOL peers.
The lack of availability by desk phone, mobile phone, email, or pager
strains credulity. While contacts are made and home phone numbers
exchanged, no real progress is made. In slightly more than four
hours, the replacement PDU comes online and my servers start lighting
up. But not all of them.
I call the NOC, “Has power been completely restored?”
“Yes, sir.”
“Some of my servers are not coming up. Can you check if they have
power?”
“Sure, let me get back to you in a half hour,” comes the typical
response and not unexpected.
At least at this point I have more work I can do to continue restoring
the system. By the time I get the call back to tell me that the
machines have power and show as on, I have a theory as to the
problem.
“Can you get someone on the console and have them type boot on the
following machines,” I ask the NOC drone.
“Sorry, I can’t do that.”
“That’s fine. Just find someone who can.”
“No, sir, I can’t have anyone do that until after the change freeze.”
A change freeze is where nothing is modified within a system for a
given amount of time in order to ensure stability. It’s fairly
standard practice, but usually break-fix work does not apply.
“What do you mean there’s a change freeze? This is a production
system which is down. It should’ve been back up hours ago. When can
someone get in there to fix it?”
Without a sense of urgency the drone responds, “The freeze is in
effect from four pm Thursday through Monday at 9.”
In a fit of apoplexy, I hang up my cell phone and then start slamming
the receiver of my desk phone. The latter being so much more
satisfactory. Running to find my VP, I explain the inanity of the
situation with far too many expletives to be politically correct. A
few calls and 45 minutes later some monkey is in the datacenter typing
boot on about a dozen servers and validating that they come up. After
giving the all clear, your boss and VP come over to tell you to go
home.
“You can write up the post-mortem later in the week. Get some rest.”
“Yeah, you look like you might kill someone.”
I apologize for my behavior and slouch in my chair.
“It’s fine,” explains the VP, “They need to learn from us. It is
completely unacceptable how much of their infrastructure is down at
any time. And the lack of response is really bad.”
I shrug as they leave and pack up my gear and go home.
As much talk as their was about AOL learning from us, the fiefdoms and
bureaucracy were too well entrenched to really change. Netscape was
just another conquered bit of territory. Bits of it were shut down or
given away to partners. Over time almost everything was moved from
California to headquarters in Dulles, Virginia. Before that happened,
though, I left for a startup. It was a better choice than waiting to
get laid off some other week. It was a better choice than walking
around the empty sad halls of Netscape.
A startup. That was the ticket…
Alcohol content: None (They killed Beer Friday)
…