A couple of days ago there was an operations problem with one of our competitors that not only resulted in a long unexpected outage, but also some data loss for their customers. To be honest I felt awful for them – we’ve had some unexpected outages ourselves and as part of the operations team here at Xero I can tell you it’s absolutely the worst feeling in the world and a stress that I can’t begin to describe.
In light of these problems, and also some discussion happening elsewhere on the topic, I thought I would try to highlight some of what we do at Xero to minimize the possibility of an outages, and that if the unforeseeable does occur, your data is as safe.
Like other small companies, when we launched the first beta of Xero we had a single server. It was hosted with an excellent service provider, and we did all the requisite backup best practices, but it was still a stressful experience to know we had a single point of failure. We knew we needed to be better than that, but it takes a lot of resources to be able to deliver an operations strategy to the levels of service that customers expect. This is one of the big costs of building software-as-a-service applications and this is one of the key reasons we did the IPO – to give us the resources to execute that strategy.
Immediately after the IPO we went into full implementation mode to build out our production infrastructure to enterprise grade.
This is not an insignificant investment. Let’s start with some numbers:
- 7 Rackspace hosted production web servers behind dual redundant F5 Big-IP load balancers
- 4 production database servers, 2 active at any one time
- 4 production VM’s running service delivery and ancillary apps like the API
- 3 production VM’s within a secure zone for bank feeds
- Almost 1TB of RAID10 SAN for the production data to store the 4+ million accounts receivable invoices and the 30+ million journals generated by customers so far
- Over 50,000 Akamai servers enabling a global content delivery network
- 20+ VM’s for monitoring, intrusion detection, pre-production, stage, dev and other back office systems
- 3 dedicated Xero IT Operations staff working with extensive service delivery teams at both Rackspace (production and pre-production systems) and Revera (development and back office systems) hosting partners
Those are just some of the numbers from today (and they’ll no doubt change by tomorrow as we continue to grow!) but they only tell a part of the story. These are some of the things we did to get to this point:
Choose a hosting provider
We’ve talked about Rackspace a lot in the past, but we can’t begin to highlight how important it was for us to choose a managed hosting provider that could cope with the service levels we required. Not only do they offer 100% network availability through state-of-the-art data centers, but they have the expertise on staff to back the technology up. With Rackspace we have a dedicated team of people to deal with any issues that may arise. Rackspace are more than just a partner – they’re actually an integrated part of our operations team.
If you have one, buy another one. “n” should never equal one in your data center. Every layer should be redundant. Multiple network providers, multiple power sources (with multiple backup generators), redundant firewalls, load balancers and highly available clustered servers – everything should be redundant (another benefit of going with a hosting provider like Rackspace is that they completely understand this and build their data centers to this model). To be honest hardware failures are very rare – but hardware failures on two or more devices are even rarer so it’s better to be safe than sorry. As far as your data is concerned it’s housed in a Fibre Channel RAID 10 SAN – lots and lots of redundant disk, with multi-path adapters through to an active/passive SQL Server cluster (4 database servers, 2 running at any one time, and the others ready to go if the primary one fails. Also handy for patching – we switch database servers at every release to make sure both servers are fully operational and completely up-to-date).
Obviously you have to backup. The first stage of any business continuity plan is to backup. But it’s actually the restore that’s important. Have you ever run a fire drill on your backups? Do it right now – take your most recent backup and restore it to another server. Did it work? Is it corrupted? How long does it take? How much data loss did you have? We run weekly full backups, nightly incrementals and transaction log backups of our database every 10 mins. We run all these straight to disk (to a different RAID 10 SAN device than the transactional data) so that we can very quickly recover your data if we need too. We also run the backups to tape for storage. And we restore at least once a month to a server running in a separate zone at Rackspace.
We run a mixture of virtualized and dedicated hardware. Our virtualization is essentially what’s known as a private cloud – our own set of virtualized servers running on our own dedicated hardware, with SAN stored VMs. This allows us to better manage our capacity planning and utilize the full resources of the hardware available to us. It also allows us to do cool stuff like move VMs around depending on utilization, backup entire VMs for easy recovery, provision new “servers” very quickly by adding pre-configured VMs, add more overall processing power with an additional hypervisor, and rescaling each VM’s needed resources. Having said that our database servers and our main app servers still run on dedicated hardware – we take a horses for courses approach to virtualization – it’s important to remain pragmatic and react appropriately.
Are we done yet? No. Within the next month we’re provisioning an additional set of servers running in an entirely different data center (a project that’s been underway for the last few months). One of the tasks of the new data center is to act as a live offsite backup. We’ll be using log shipping over a secure site-to-site VPN to enable us to have a live offsite replicated backup of our databases. One of the problems with offsite backup is that it’s usually to tape, or to a simple filestore, and that recovery can be very very slow. Our approach not only provides us with a near real-time offsite backup of your data, but also greatly improves our ability to recover.
And finally: hire good people! We have an amazing dedicated operations team that has intimate knowledge of our applications, databases and our needs as a business, working with the team at Rackspace to provide 24/7 service, support and strategy. They’re constantly monitoring our production environment – understanding the performance of every part of the system, capacity planning new areas of the system, maintaining the security of the system and keeping it up as much as possible. The culture in the team is not one of avoiding failure – it’s about being the best, and continuing to make Xero better to deliver the best service possible to our customers.
Read more about Technology
7 September 2010 #
8 September 2010 #