Skip to content

Unexpected outage for some users. (resolved)

We are currently experiencing an issue within our application hosting infrastructure that is affecting a proportion of Xero Customers. Xero’s operations team is working to identify and resolve the cause of the issue an as soon as possible.

We will provide any updates, including expected time for the service to be restored, as soon as we have more information.

UPDATE: (Wednesday 17th July, 8:51pm GMT) – Operations have identified the cause of the partial outage and are working to resolve as soon as possible.

UPDATE: (Wednesday 17th July, 8:59pm GMT) – Operations team have resolved the cause of this partial outage. Our operations team are closely monitoring the situation for the potential of any re-occurrence.

UPDATE (Friday 19th July, 2:35am GMT):

The outage was related to a self-contained component of Xero’s database stack, which while critical for the operation of Xero, the impact was limited to 20% of our customers. We have a resilient database environment with hot standby servers and automatic failover.  In this particular instance failover did not occur due to a highly unusual chain of events.

Our investigations have focused on both why the database server became unresponsive, the reason that the resilience did not provide seamless failover when one node became unresponsive, and the time to resolution.

  1. App server access to the database server became unresponsive due to a momentary interruption to the TCP stack of the database server, where app server connections became queued and eventually overwhelmed the database servers TCP connectivity. There was no impact to the stability of the database engine itself, or customer data loss. We have identified the processes at fault and are undertaking process changes to remove the likelihood of this reoccurring.
  2. The database itself did not trip an automated failover to the standby database hosts as the availability monitoring determined that the database engine was still operating correctly, all be it inaccessible from our app servers. Xero Operations is continuing to investigate how we better manage automated failover in a similar error state.
  3. Our monitoring systems failed to detect the failed state of app server connections on the active database node and alert the Operations team automatically.  We have identified the reason this particular failure state was not identified as expected and are adapting our monitoring systems to promptly alert our Operations team.

Our operations team continuously analyse the platform and look for ways to improve reliability and performance. We treat any issues seriously and the team are very aware of the impact that system issues have on our customers.  While we maintain a very high up-time we will continue to work to eliminate risk wherever possible.

 

Read more about Company News

 

8 comments

Jeremy
18 July 2013 #

When can we expect this to be resolved? I have a client who needs to be in their Xero to invoice this morning

Robbie Dellow
18 July 2013 #

What was the infrastructure outage caused by. I come from an infra background so would be curious as to why no alerts were made or redundancy kicked in.

Paul Rushworth
18 July 2013 #

@Jeremy – The issue has been resolved. If you are still experiencing issues logging in to Xero, please contact our support team.

@Robbie Dellow – Operations are still working on identifying the root cause of the partial outage and will update this post with further clarification upon our finding.

Kelvin Hartnall
20 July 2013 #

Thanks for the update and analysis. Any system will have some level of outage, but really appreciate the openness and transparency. And sounds like you have performed some good analysis and have found things to improve in the system.

Gerry Scullion
21 July 2013 #

Looks like the problem has reared it’s head again. Really disappointed as I’m logging in specifically to try and resolve a problem that was caused by your system duplicating entries!

This is last FY that I’ll be using Xero -

Paul Rushworth
22 July 2013 #

@Gerry Scullion – The issue has been resolved and has not re-occured. If you are still experiencing issues logging in to Xero, please contact our support team.

Robbie Dellow
22 July 2013 #

Hi Paul – assume you are still going to fill me in on the root cause of the ‘partial’ outage.

Paul Rushworth
22 July 2013 #

@Robbie Dellow. Hi, please see the body of the blog, we updated analysis on Friday afternoon NZT.

Add your comment





We welcome all feedback but prefer a real name and email address.