Gather around children and hear my tale of years yonder when music was consumed on physical media.  The primary form factor of the late 1990s through early aughts being the Compact Disc or CD.  These fragile plastic discs came in a protective shell of even more fragile plastic that was designed to shatter while attempting to remove the protective plastic film that wrapped the shell.  So the device seen above was invented to slice open this plastic film in hopes that you could retrieve your CD without anything breaking.

The particular example above came from my short stint at Napster after they had been shut down by the courts, but before they filed bankruptcy and sold everything to Roxio.

I’ll have to check if these work on Blu-Ray cases before those go obsolete as well.

Alcohol content: Pliny (beer)


I’m fairly ready for this year to be over.  I was actually fairly ready before January ended, but decided to wait and see.  February is drawing to its short close at this point and with a +/- margin of 5%, I’m 70% certain I could be done for the year.

Either it’s karmic retribution for some truly bad choices I’ve made over the last couple of years, or my warrantee has expired.  Things have gone from bad to worse with a few moments of greatness and a lot of sadness.  The background noise of another  person  of  note  dying  this year just adds to the general mood.  I’m certainly done with the shouting match that’s become of the election season.

It would seem that a recurring upper respiratory infection has taken a hold of me.  Each cycle lingering longer with more convulsive coughing fits and lung butter.  Cops have taken to staking out my domicile as a potential meth lab.  My pharmacist must have narced on me with my frequent and regular pseudoephedrine purchases this year.

Whatever I’ve got appears to be a viral infection.  Chest x-rays and other tests always come back negative for bacterial.  Some ingested and inhaled steroids are the call to order to kill that off for good.  And this time, the doctors mean it.

If that were my only problem, then I’d tell myself to shut up and stop whinging (actually, I do that anyway, but I keep typing nonetheless).  Because for added fun, I’ve gone four 12-lead EKGs, a chest x-ray, and a CT scan in February alone.  Three different blood draws were done in the last 7 days.  On the blood panels, everything looks damned near normal.

So at this point, all the doctors can’t seem to understand what’s going on with the Tin Man.

Monday is a new day, with more doctors.  Maybe some answers will come.

Alcohol content: other ( cough syrup codeine counterindicated)

Continue reading

It’s been a bit more than a year since the last major outage with AWS.  But in 2014, there were a few moments of major.  Making sure you had an architecture that could survive failures is important.  But sometimes the chaos monkeys run a bit more out of control than normal.When Amazon has a Xen patch that requires a reboot, it can affect a whole lot of us.  They do a really good job of trying to avoid rebooting the underlying hosts that run the VMs, but sometimes it’s unavoidable.

In September 2014, AWS, by their accounts, had to reboot 10% of their systems.  That’s potentially a whole world of hurt if that percentage overlaps significantly with your production environment.Whether you use Amazon, Rackspace, or some other Infrastructure as a Service provider, you go into those environments expecting a certain level of impermanence. With the transient nature of virtual machines and the promise of being able to quickly spinup clones, you build out your services with a level of redundancy to handle failure.

Amazon recommends that you build your services to be completely stateless in order to work around the chaos. If you are doing simple web services with all of your state stored in RDS, then this is a trivial problem. But if you are deploying legacy applications or some other clustered data storage on top of EC2, life becomes more complicated.

Netflix has been the great champion of building reliable systems on top of Amazon AWS while expecting failure. They have created tools like Chaos Monkey and the Simian Army to help themselves survive the typical failures experienced in AWS and other cloud providers, and they have generously shared these tools with the greater community. But what happens when you get an email that over the next four days, 90% of your key infrastructure will be rebooted in 30% chunks? If you are Netflix, you probably already have a large enough infrastructure to manage a scenario of this magnitude. If your service has just a couple of instances with no state information, you can easily move it around the reboots. The real fun starts when you have a reasonably sized infrastructure but aren’t quite big enough to have a full active-active disaster recovery setup that enables zero downtime.

How Our Team Managed the Great AWS Reboot of 2014

One Thursday afternoon in September, we received a pleasant email from AWS letting us know that some of our instances might be affected by an upcoming maintenance. The fact that the email did not explicitly list the instances or when the maintenance would occur should have been the first sign of concern. When we visited the Events tab in our EC2 console, we saw a long list of what appeared to be every instance we had in that region. Then we checked the other regions we utilize; it was the same story in each case. In 13 hours, a large swathe of our production infrastructure would get rebooted.

By this time, the rumor mill that is the Internet was already spewing forth a lot of speculation. But Amazon was not releasing any details. How could they given the nature of the security hole they were patching? We got on the phone with AWS support–multiple times–and received differing answers for the reasons and differing levels of assurance as to the recovery of our instances. The logical conclusion we could draw: to work entirely around the reboots. We went about reviewing all of our instances on the list, determining the reboot window in which they were placed, and creating replacement instances.

The main concern wasn’t so much the application servers within EC2.  Those were “stateless” enough that they would survive having instances falling in and out of the group.  The NoSQL cluster which was the primary data store, on the other hand, would have serious issues if a third of the cluster went away.  This is what needed to be examined closely to see how we could live through the reboots.

As is typical good practice, within any region we attempt to spread our load across all available availability zones (AZ).  The maintenance windows affected an AZ per region per reboot window. Attempting to maintain this AZ balance resulted in a game of whack-a-mole: Spin up new instances and wait an hour to see if those would then appear to require a reboot. Eventually, we had enough new instances to expand the size of our cluster to counter the effects of the soon-to-be-lost nodes. Once the NoSQL cluster was expanded, we deallocated the affected nodes in order to have data moved from those instances.

With the data relocation storm proceeding well, we could focus on the application servers that we use in the care and feeding of our service. Many of these applications would have survived without additional resources. But we made a decision that there should be no degradation of our service. With additional resource spun up and affected instances taken out of rotation, we waited for the first wave of reboots…

And nothing happened.

Results

Our NoSQL cluster remained up and in clear health for the entirety of the maintenance reboots. Writing new records never blipped, slowed down, or became otherwise unavailable. We continued to be able to process records and make them available through our search UI and API as if there was nothing happening within Amazon. In spite of a few late nights and early mornings, the effects of the AWS reboot events on our customers was nil. And in the process, we’ve made our operational processes even more resilient.

Lessons

  • Life happens:Always be prepared to act quickly when receiving a maintenance notice.

  • Visibility is critical: When dealing with a service provider’s support team during an emergency, there are no ‘little details’. Consult all your resources within the organization to get the clearest, most granular picture you can.

  • Prepare for (massive) failure from the start:It doesn’t matter who owns it, hardware fails, VMs fail, software fails. Your response plan can’t. Our plans to survive failure allowed us to systematically deal with these reboots with confidence.

  • Documentation and training matter:Keep your documentation complete and up to date and even your newest employees can jump in and help in a crisis if they have a good reference upon which to base their knowledge.

  • The real enemy is silence: ”No plan survives contact with the enemy.”  Make sure your disaster recovery plans are flexible and communicated. You will never account for all possible failure modes but you can outline strategic paths.

  • Seconds count:Package and configuration management systems enable you to recover faster and more consistently.  Always look to automate, automate, automate.

  • Localized war room:Sometimes getting everyone in one room to hash out a plan to solve a problem is the fastest way to a resolution and the unity ensures all angles are covered.

This won’t be last time it rains in the cloud… the network never sleeps and sometimes neither do we.  The team I was working with was impressive. They stopped everything and everyone (CEO, Ops, Engineering, Support, Sales) came together and made a small sacrifice to ensure it was business as usual for our clients.

Continue reading

I was part of an unexpected layoff a while back. It was a quiet affair, myself and a couple others let go with no announcement. As I packed up my things a couple of my now former coworkers handed me a small paper bag.  As a parting gift they had gotten me a bottle of Ron Zacapa rum. A very nice gesture and a really nice beverage. It wasn’t until much later at home when I was about to recycle the bag the bottle came in that I noticed the card.


Simple, to the point, and awesome.Thanks, guys!  Hope to work with you reprobates again someday. 

Continue reading

Whoa is Rosberg and the Mercedes on race day.  This is the same kind of performance we saw time and again last year.  One of the fastest cars on the track for one lap and some possibly good qualifying speed, but then slowly losing position throughout the race.  Is this a tire issue?  That might explain why Hamilton was able to gain position.  He is very good in dealing with a squirrelly car with low traction.   And if Rosberg’s rear tires were going off prematurely, then that would explain why he was so easy to pass.  Not being able to lay down the power out of the corners would make him a sitting duck.

The opening laps were amazing.  Heck, the first corner was crazy.  Rosberg defended his P1 position so aggressively from Vettel, that he nearly opened the door to Alonso to pass them both from 3rd.  But Vettel eventually regained not only his 2nd position from Alonso, but then passed Rosberg to take the race lead and basically hold it for the rest of the race.  Because, really, once Vettel took the lead he just walked away from the rest of the crowd.  At one point he had a 24 second lead over di Resta and was able to pit and return in the same position.

The Force India of di Resta did a great job and nearly made the podium.  But it was still a great result for the Scotsman as he showed a great deal of car control and maturity as he let the Lotus of Grosjean past with five laps to go.  The fresh tires of Grosjean were no match for di Resta on his two stop strategy.

About the only people to have a worse day than Rosberg would be Alonso and Massa.  Alonso’s broken DRS really inhibited his progress, but even with two extra pit stops, he did make a valiant effort.  Massa with his broken front wing led to all sorts of problems with his tires.  First, the wear rate was far higher than expected, then multiple punctures on his right rear didn’t help either.


Yeah, that’ll affect handling.

What I’m a bit confused about this season is the passing.  Or more specifically the defending.  As was beaten into the collective minds of the sport for the last few seasons is one move and you must leave a car width on track if you are the defender.  On top of that, if you have two cars side by side with even a marginal overlap, the car in front is supposed to allow space for the other car on track.  We saw a lot of cars being pushed off track.  The top three culprits being Rosberg, Perez, and Webber.


F1 meets WRC

 Last year we would’ve seen notices of a steward’s investigation of an incident.  And like 99% of all investigations, they would determine something after the race, which is annoyingly pointless, but that’s another article.

Still, we had a lot of action.  And again, action between teammates.  This time Perez and Button.  Sadly, Button just sounded like a whiny bitch on the radio.  But it was a bit ridiculous to get tapped in the rear tire by you own teammate.


Team players.

So, overall, a really entertaining race.  It would’ve been interesting to see where the Ferraris ended up had they not had the myriad mechanical issues.  But, still a fun race even if we got a repeated podium.  Well, except for this:
This is a race EE.

Which gets to my last and only real gripe.  To Leigh Diffey, Steve Matchett, and David Hobbs, the constructor’s representative on the podium is an engineer who happens to be a woman– not a girl, a woman.  Holy crap, guys, it’s like you’d never seen a woman who worked in motor racing other than holding a grid sign!  I realize that you had no idea who this person was until Vettel mentioned her name and function in the interview with David Coulthard, but you could have attempted not to be blubbering blithering idiots for a few minutes up to that point.  Maybe you could’ve noticed that only 1 of the 3 governmental representatives actually shook her hand, because, well, women are scary.  But, fuck, just because some folks have ridiculous misogynistic institutional fears doesn’t mean you have to stoop to their level of cluelessness.

Gill Jones is responsible for all the electronics in the car and garage.  That’s right, all the fricken wiring, telemetry, radios, computers, fiddly buttons on steering wheels, ecu, KERS, etc.  She’s been at this game for a good number of years along with a good number of other folks who just happen to be female.  You might want to take note, she is part of a large and growing segment of the population.  Try not to act like a sexist neanderthal.

Alcohol content: buzzin’ (dark and stormy ++)

Continue reading