We are currently experiencing an issue within our application hosting infrastructure that is affecting a proportion of Xero Customers. Xero’s operations team is working to identify and resolve the cause of the issue an as soon as possible.
We will provide any updates, including expected time for the service to be restored, as soon as we have more information.
UPDATE: (Wednesday 17th July, 8:51pm GMT) – Operations have identified the cause of the partial outage and are working to resolve as soon as possible.
UPDATE: (Wednesday 17th July, 8:59pm GMT) – Operations team have resolved the cause of this partial outage. Our operations team are closely monitoring the situation for the potential of any re-occurrence.
UPDATE (Friday 19th July, 2:35am GMT):
The outage was related to a self-contained component of Xero’s database stack, which while critical for the operation of Xero, the impact was limited to 20% of our customers. We have a resilient database environment with hot standby servers and automatic failover. In this particular instance failover did not occur due to a highly unusual chain of events.
Our investigations have focused on both why the database server became unresponsive, the reason that the resilience did not provide seamless failover when one node became unresponsive, and the time to resolution.
- App server access to the database server became unresponsive due to a momentary interruption to the TCP stack of the database server, where app server connections became queued and eventually overwhelmed the database servers TCP connectivity. There was no impact to the stability of the database engine itself, or customer data loss. We have identified the processes at fault and are undertaking process changes to remove the likelihood of this reoccurring.
- The database itself did not trip an automated failover to the standby database hosts as the availability monitoring determined that the database engine was still operating correctly, all be it inaccessible from our app servers. Xero Operations is continuing to investigate how we better manage automated failover in a similar error state.
- Our monitoring systems failed to detect the failed state of app server connections on the active database node and alert the Operations team automatically. We have identified the reason this particular failure state was not identified as expected and are adapting our monitoring systems to promptly alert our Operations team.
Our operations team continuously analyse the platform and look for ways to improve reliability and performance. We treat any issues seriously and the team are very aware of the impact that system issues have on our customers. While we maintain a very high up-time we will continue to work to eliminate risk wherever possible.