The Planet Downtime
Ah the joys of being a dedicated server provider. Not only do you get to deal with whiney customers who pay nearly nothing and expect the world, once in a while mother nature steps in and decides to mix things up a bit: Explosion At ThePlanet Datacenter Drops 9,000 Servers. Normally I would just chuckle and move on, except my server happens to be hosted in that data center. Blah.
The issue started Saturday evening around 5pm when connectivity to my server glitched for about 5 minutes. Didn’t seem like a big deal, until about 15 minutes later when it died completely. I really don’t do much with my server, so I let it slide another 15 minutes. By this time the traceroutes were looking strange so I submitted a ticket, then checked the forums and found out that other people were having problems too. The bandwidth graphs showed a huge drop in traffic across 4 of their providers, so it was something major.
Around 7:30 someone from TP staff posted on their forums and said a transformer caught on fire. Been there, done that .. shit happens, eh? Then around 9:30 it got updated to “the electrical room exploded, knocking down 3 walls”. Whoa, have not seen that one happen before. Here is the forum thread where they have been updating customers and a web site for live updates:
http://forums.theplanet.com/index.php?showtopic=90185
http://service-update.theplanet.com/
Having worked for a couple hosting providers before, I feel like I have a unique perspective on this. As a customer, I am fairly annoyed and very glad that I do not use my server for revenue generation. However, as someone who has been on the provider side, I have mixed feelings about the issue:
- MINUS: How does such a massive explosion take place out of the nowhere? It is rare to see power infrastructure catastrophically fail without outside interference. This could be a natural disaster, human stupidity (pick-up truck anyone?), or negligence. The DC engineer I worked with performed thermal inspections of all power mains every 1-3 months to make sure nothing was overheating, which is a sure sign that something needs to be fixed. Something major like this should send off warning signs before it literally blows up.
- PLUS: Their response to the disaster has been very impressive. They mobilized a lot of people very quickly, assessed the situation, and got a working solution in place within 48 hours. 48 hours seems like a long time, but for losing an entire electrical room that is pretty good.
- MINUS: Contingency plans. Yeah, this is pretty much a freak accident and they were not allowed to use their own generators by the fire department, but there needs to be a backup plan. I seem to recall another Texas provider who was able to get several rolling generators on-site within 2 hours of a similar power incident. Anyone remember the ISP that managed to stay running through Hurricane Katrina and the debacle that followed? That is one hell of a contingency plan.
- PLUS: They have been proactive on letting customers know about how they are going to handle service credits, and have even offered free server moves to another data center due to ongoing power issues. I don’t foresee them giving customers a hard time when it comes to living up to their SLA guidlines.
- MINUS: Mission critical infrastructure in one place! They had all of their legacy DNS servers and customer management portal in the same data center. The loss of DNS meant that customers with servers that were not directly affected were still useless because no one could resolve their domains.
As for the customers who complain about the excessive downtime and how much it is costing them: if downtime costs so much, why are you not hosting in a high availability environment? Why do you not have servers in multiple data centers? It is kind of hypocritical to gripe at the hosting provider for not being fully redundant when your server infrastructure is not redundant either. The Planet has 5 or 6 data centers that customers can choose from, and there are many other providers with data centers across the country. Worst case, you have a few hours of downtime if you’re using Lazy Man’s Failover (DNS).