As you may have seen through the Xero in-app notifications, we’ve scheduled an outage at 5am Monday 4 July 2011 NZT (click here to find out what time it is for you) to undertake some hosting infrastructure maintenance.
We wanted to share a little more about this particular outage and some further infrastructure upgrades we’re planning in the coming weeks as we continue to expand our platform to accommodate our rapid customer growth.
During July, Xero had two short unscheduled outages. Our investigations showed both incidents were caused by a form of ‘race condition’ or deadlock in our Microsoft SQL Server database layer, which caused requests to the database to lock up. We escalated the issue to Microsoft Premier Support, through our hosting partner Rackspace, which identified a software patch for this particular deadlock issue. We’ve successfully tested this fix through our dev and staging environments and we’re now ready to roll this into our production database environment.
As we use redundant database clusters in the production environment, it may have been possible to make these changes without an outage. However, due to the nature of the fix, we’ve chosen to take the more cautious approach and schedule an outage at a low usage time to ensure the fix is applied without any problems or risk to customer data.
We’re pleased to have identified the cause of these two recent issues. It’s disappointing to have any unscheduled outages, but to put these in context we’ve maintained a 99.99% service availability since we launched Xero more than four years ago.
Looking ahead we have some other big platform changes happening though July. We’re replacing our current database server hardware with new server hardware, increasing redundancy with additional active cluster nodes and all our production storage is being migrated to dedicated SAN storage. The new hardware has twice the capacity of our current database platform and is also an important step towards our horizontal scale out strategy.
While we can make much of this move without any scheduled outages, we’re going to need to take Xero Personal offline for a couple of hours next weekend and then all of the Xero apps in late July, again for around two hours. Of course you’ll get an app notification with more details on these scheduled outages several days before they happen.
These changes to the hosting platform will step us up another level in scalability so we can continue to accommodate the accelerating growth we are seeing in customers and transactions on the Xero platform.
Update: The database upgrade was completed in 55 minutes with a further 10 minutes of testing. Thanks for your patience.