We try really hard to do the right thing at Xero but occasionally there are things we can’t control. Today was one of those days … In keeping with our goal of being transparent we thought we’d share with you the challenges we faced and how we managed them during the outage we had this morning.
Around 8:15am NZT while doing some routine checks on our servers, both Craig, our Chief Technology Officer and Paul, our Infrastructure Manager, lost their connection to our hosting environment. Almost immediately our monitoring alerts indicated that a number of our production systems were unavailable. The initial indication was that this was due to a network issue, but shortly after that we got our first update from Rackspace advising it was more widespread – a power outage at their Dallas data center where Xero is hosted from.
Fortunately our blog was still up, so at around 8:50am NZT we posted our first notification to customers explaining that Xero was unavailable due to this power outage. Shortly after the first blog post Rackspace started to restore power to the data center and the full system was back live around 9am NZT. The total outage was approximately 45 minutes.
Throughout the morning we continued to use both our blog and Twitter to keep everyone up-to-date. In fact the amount of blog traffic caused a short outage on the blog itself! This was quickly rectified by extending the hosting capacity specifically for the blog.
Power is obviously a critical element to our and any other online service. What happened today is a very rare occurrence, especially for a provider such as Rackspace that prides themselves on high availability. In more than 2 years our system availability or service level has been 99.99%. We are standing by for a full debrief from Rackspace and from this we will consider what further improvements we can jointly make to minimize the risk of a similar outage.
We are dedicated to providing a world-class service to our customers and we apologize to anyone who was affected by this downtime, but stress at no time was there any risk to your data. We trust that the open and frequent updates kept everyone abreast of what was happening.