Brought to you by

Unexpected Outage (resolved)

Posted 6 years ago in Xero news by Duncan Ritchie
Posted by Duncan Ritchie

We are currently offline due to an issue within our application hosting infrastructure. We have identified the cause of the issue and are working to resolve this.

We will provide any updates, including expected time for the service to be restored, as soon as we have more information.

UPDATE: (Wednesday 14th Nov, 9:57pm GMT) – We have the service back online and we will provide an update once our internal debrief is complete.

UPDATE  (Wednesday 14th Nov, 10:38pm GMT) – A quick update with some further details on this outage. The initial cause was an issue within our application database server cluster, one of the active servers failed however it was still showing as operating normally which prevented failover to the redundant node.  We will continue to investigate why the resiliency did not work as expected.

UPDATE: (Thursday 15th Nov, 9:30am GMT) – We have further instability as a result of reoccurrence of this issue.  The system is stable and operating as expected again.  Our operations team are closely monitoring the situation.

We apologise for the inconvenience caused to any customers by this outage. We continue to make a very significant investment in our hosting platform and resulting service availability level, which we have maintained at 99.99% since we launched Xero in 2007. It frustrates us if we have any unscheduled downtime, but we work hard to learn whatever we can from these incidents and use this to further improve and strengthen our platform going forward.

UPDATE (Thursday 15th Nov, 10:14pm GMT) – Yesterday we had issues with the  stability of our platform resulting in four customer impacting outages totalling 47 minutes.

All of the outages related to our database layer which is critical for the operation of the application.  We have a resilient database environment with hot standby servers and automatic failover, however yesterday this failover did not work how it has in the past or during regular testing.

Of the four events two were the result of a server becoming unresponsive without the expect failover and two were a result of excessive database load that did not require a failover.

Our investigations have focused on both why the database server became unresponsive and the reason that the resilience did not provide seamless failover.

  1. The database server became unresponsive due to unusual batch process load causing contention between the memory demands of the database and operating system.  We have identified the processes at fault and have made the first change to remove the likelihood of this reoccurring.  We have additional work underway to implement a permanent fix that prevents this issue occurring.
  2. Our database layer relies on Windows Clustering for resilience however this did not automatically failover as it should have.  We have identified the reason this did not work as expected and verified the issue with the supplier.

We are in the later stages of a project to migrate away from Windows Clustering as part of a wider project to improve the resilience of our platform.  We expect to make this change early next year.

Our operations team continuously analyse the platform and look for ways to improve reliability and performance. We treat any issues seriously and the team are very aware of the impact that system issues have on our customers.  While we maintain a very high uptime we will continue to work to eliminate risk wherever possible.


Robbie Dellow
November 15, 2012 at 11.29 am

Will be interested to hear more about why failover to the mirrored server(s) didn’t occur, when you are made aware of this. And if this is a redundancy issue across ‘the board.’
I assume your sys-admins have mobile/paging notifications when problems, such as this, occur?

Donna Richardson
November 15, 2012 at 12.44 pm

I love Xero but this morning and right now (1.30pm) I can not use it – when will it be back online please

Norman Vincent
November 15, 2012 at 12.58 pm

Still seems very slow – particularly retrieving published reports

Duncan Ritchie Xero
November 15, 2012 at 1.25 pm

@Robbie: The hosting environment is highly available and has automated failover in the event of a host failure. In this case the failed host was still reporting that it was fully operational which resulted in our operations team forcing a manual failover.
We continually look at options to improve the resilience of the platform and have a robust process to analyse why problems have occurred and how we can remove the risk.
Our operations team were aware of the issue within seconds of it occurring and they have the tools to be advised of issues 24×7.

Duncan Ritchie Xero
November 15, 2012 at 1.27 pm

@Donna and @Norman: The slowness you observed was resolved within a few minutes. It was unrelated to the outage, but we are investigating the cause.

robbie dellow
November 15, 2012 at 6.55 pm

Hi Duncan. Thanks or the reply but even though your op’s team were made aware of the issue within seconds, fail-over did not occur. Where I come from, with working in investment banking, seconds of downtown on a critical trading server can spell $100’s of thousands lost.
Button line, it seems automatic redundancy didn’t kick in and even if it did, the admins still get advised of course, so they can then fix the primary server. I hope you are still drilling down as to why this happened, because until then you maybe have to wonder about the automated redundancy setup on other servers. imho

Christian Holm
November 15, 2012 at 9.40 pm

99.99% availability? I assume that is not counting maintenance windows? Unavailable is unavailable for whatever reason from a users point of view.

Phil McKell
November 15, 2012 at 9.44 pm

Hi, I cannot logon to Xero – suggests the service is not back up and running as stated in your Company news blog, can you confirm if it is available to folk in Scotland?


Duncan Ritchie Xero
November 16, 2012 at 10.22 am

@Christian: We measure our availability based on unplanned downtime, as is consistent with the rest of the software industry. While we agree that this doesn’t fully represent the total time you cannot access the application it is intended to only cover periods of unavailability that you are not notified about and therefore cannot plan around.

We are working on a number of initiatives to reduce the reasons we need to have planned downtime and have already made significant inroads. At this point less than 10% of our releases require an outage and in most cases we choose to take an outage to reduce the risk of unexpected customer impact.

Leave a reply

Your email address will not be published. Required fields are marked *