Skip to content

Xero Operations

A couple of days ago there was an operations problem with one of our competitors that not only resulted in a long unexpected outage, but also some data loss for their customers. To be honest I felt awful for them – we’ve had some unexpected outages ourselves and as part of the operations team here at Xero I can tell you it’s absolutely the worst feeling in the world and a stress that I can’t begin to describe.

In light of these problems, and also some discussion happening elsewhere on the topic, I thought I would try to highlight some of what we do at Xero to minimize the possibility of an outages, and that if the unforeseeable does occur, your data is as safe.

Like other small companies, when we launched the first beta of Xero we had a single server. It was hosted with an excellent service provider, and we did all the requisite backup best practices, but it was still a stressful experience to know we had a single point of failure. We knew we needed to be better than that, but it takes a lot of resources to be able to deliver an operations strategy to the levels of service that customers expect. This is one of the big costs of building software-as-a-service applications and this is one of the key reasons we did the IPO – to give us the resources to execute that strategy.

Immediately after the IPO we went into full implementation mode to build out our production infrastructure to enterprise grade.

This is not an insignificant investment.  Let’s start with some numbers:

  • 7 Rackspace hosted production web servers behind dual redundant F5 Big-IP load balancers
  • 4 production database servers, 2 active at any one time
  • 4 production VM’s running service delivery and ancillary apps like the API
  • 3 production VM’s within a secure zone for bank feeds
  • Almost 1TB of RAID10 SAN for the production data to store the 4+ million accounts receivable invoices and the 30+ million journals generated by customers so far
  • Over 50,000 Akamai servers enabling a global content delivery network
  • 20+ VM’s for monitoring, intrusion detection, pre-production, stage, dev and other back office systems
  • 3 dedicated Xero IT Operations staff working with extensive service delivery teams at both Rackspace (production and pre-production systems) and Revera (development and back office systems) hosting partners

Those are just some of the numbers from today (and they’ll no doubt change by tomorrow as we continue to grow!) but they only tell a part of the story. These are some of the things we did to get to this point:

Choose a hosting provider

We’ve talked about Rackspace a lot in the past, but we can’t begin to highlight how important it was for us to choose a managed hosting provider that could cope with the service levels we required. Not only do they offer 100% network availability through state-of-the-art data centers, but they have the expertise on staff to back the technology up. With Rackspace we have a dedicated team of people to deal with any issues that may arise. Rackspace are more than just a partner – they’re actually an integrated part of our operations team.

Multi-path everything

If you have one, buy another one. “n” should never equal one in your data center. Every layer should be redundant. Multiple network providers, multiple power sources (with multiple backup generators), redundant firewalls, load balancers and highly available clustered servers – everything should be redundant (another benefit of going with a hosting provider like Rackspace is that they completely understand this and build their data centers to this model). To be honest hardware failures are very rare – but hardware failures on two or more devices are even rarer so it’s better to be safe than sorry. As far as your data is concerned it’s housed in a Fibre Channel RAID 10 SAN – lots and lots of redundant disk, with multi-path adapters through to an active/passive SQL Server cluster (4 database servers, 2 running at any one time, and the others ready to go if the primary one fails. Also handy for patching – we switch database servers at every release to make sure both servers are fully operational and completely up-to-date).

Backup/restore

Obviously you have to backup. The first stage of any business continuity plan is to backup. But it’s actually the restore that’s important. Have you ever run a fire drill on your backups? Do it right now – take your most recent backup and restore it to another server. Did it work? Is it corrupted? How long does it take? How much data loss did you have? We run weekly full backups, nightly incrementals and transaction log backups of our database every 10 mins. We run all these straight to disk (to a different RAID 10 SAN device than the transactional data) so that we can very quickly recover your data if we need too. We also run the backups to tape for storage. And we restore at least once a month to a server running in a separate zone at Rackspace.

Virtualize

We run a mixture of virtualized and dedicated hardware. Our virtualization is essentially what’s known as a private cloud – our own set of virtualized servers running on our own dedicated hardware, with SAN stored VMs. This allows us to better manage our capacity planning and utilize the full resources of the hardware available to us. It also allows us to do cool stuff like move VMs around depending on utilization, backup entire VMs for easy recovery, provision new “servers” very quickly by adding pre-configured VMs, add more overall processing power with an additional hypervisor, and rescaling each VM’s needed resources. Having said that our database servers and our main app servers still run on dedicated hardware – we take a horses for courses approach to virtualization – it’s important to remain pragmatic and react appropriately.

Are we done yet?  No.  Within the next month we’re provisioning an additional set of servers running in an entirely different data center (a project that’s been underway for the last few months). One of the tasks of the new data center is to act as a live offsite backup. We’ll be using log shipping over a secure site-to-site VPN to enable us to have a live offsite replicated backup of our databases. One of the problems with offsite backup is that it’s usually to tape, or to a simple filestore, and that recovery can be very very slow. Our approach not only provides us with a near real-time offsite backup of your data, but also greatly improves our ability to recover.

And finally: hire good people! We have an amazing dedicated operations team that has intimate knowledge of our applications, databases and our needs as a business, working with the team at Rackspace to provide 24/7 service, support and strategy. They’re constantly monitoring our production environment – understanding the performance of every part of the system, capacity planning new areas of the system, maintaining the security of the system and keeping it up as much as possible. The culture in the team is not one of avoiding failure – it’s about being the best, and continuing to make Xero better to deliver the best service possible to our customers.

 

Read more about Technology

 

6 comments

Andrew Haynes
3 September 2010 #

Reassuring and something that most of the competitors cannot provide because they don’t have the cash behind them.

Blackbox Deals
6 September 2010 #

A really interesting look at the inside of how Xero works. It’s definitely a reassuring read!

James
7 September 2010 #

Out of interest, who was the competitor that had the outage and data loss?

Richard Frances-Moore
8 September 2010 #

Thanks For the info, good to know how seriously you guys are taking this aspect of your product.

From the very beginning Xero have been a safe pair of hands with my data and With them being a local company I’ve not worried too much. However I would still like to be able to download a backup of my accounts and I imagine international customers would feel this even moreso.

Recently I have been experimenting with cloud based team management solutions and would love to implement Manymoon’s social productivity app as we already use google apps and it integrates well with those. However they only allow downloading of data on their enterprise level (expensive) accounts with a 24hr turnaround and their support has been awful (many days to respond, even on a paid for account). So they have a great product but not one I can trust when they are across the world and it’s hard to know how reliable they are. Unlike accounting we cannot afford to be without access to our management systems for even a day so even if Manymoon are indestructible our internet connection is not.

Might want to think about how someone across the world views the trustworthyness of a company that is going to hold their critical financial data but offers no way of letting it off their servers…

-RichardFM

Craig Walker
8 September 2010 #

@James I don’t want to get into a throwing stones situation – I’m sure you could Google it.

@Richard We’ve had this request quite a bit actually. You can currently export quite a lot of your data from Xero for you to backup how you see fit (see http://help.xero.com/#ImportExport for further details on what you can currently get).

However if you wanted it in one package (and including absolutely EVERYTHING) then the problem becomes format. We run a multi-tenanted data store – that means that we store everyone’s data in (conceptually) one database (with lots of controls in place to segment the data by customer to completely secure it). So we can’t give you a database backup as we would back it up. We would have to give you some kind of export file specific to your organisation. And then the problem becomes how useful that file is if it only imports into Xero? (And we would have to build that import in the first place)

Those problems aside it is something we’re adding too over time. Hopefully one day soon we’ll have enough CSV based exports available that means you can get everything for you to backup yourself. One idea we were toying with at one stage was to do these backups on your behalf to an independent cloud provider (maybe using Azure Storage or something). Even though we would give you access to these backups it means that you’d need to trust more than just one cloud provider.

Craig

Trusting the Cloud
16 September 2010 #

[...] the outage, a number of vendors have written posts detailing their particular operational procedures with regards redundancy and backups, but this may [...]

Add your comment





We welcome all feedback but prefer a real name and email address.