Last month, software tool vendor Atlassian suffered a major network outage that lasted two weeks and affected more than 400 of its more than 200,000 customers. The interruption removed several of their products, including Jira, Confluence, Atlassian Access, Opsgenie and Statuspage.
Although only a few customers were affected in the full two weeks, the break was significant in terms of the depth of the problems found by the company’s engineers and the length of time they had to walk to find and fix the problems.
The outage was the result of a series of nasty internal errors by Atlassian’s own staff, not the result of cyberattacks or malware. After all, no customer has lost more than a few minutes of data transactions, and most customers have not noticed any downtime.
What is interesting about the whole situation with the interruption of Atlassian is how badly they managed to inform their clients about the incident and then how in the end published a long blog post that goes into great detail regarding the circumstances.
It is rare for a provider who has been affected by such a massive and public outage to make an effort to carefully put together what happened and why, and to provide a roadmap from which others can learn.
In the publication, they carefully describe their existing IT infrastructure, point out the shortcomings in their disaster recovery program, how to correct its shortcomings to prevent future disruptions, and describe the timing, work processes and ways in which they intend to improve their processes.
The document is candid, factual and full of important findings and should be a must-read for any engineering and network manager. It should be used as a template for any business that depends on software to find and correct similar mistakes you may have made, and also serve as a discussion framework for honestly evaluating your own disaster recovery books.
Lessons learned from the incident
The problems began when the company decided to delete a legacy application that had become redundant by purchasing functionally similar software. However, they made the mistake of appointing two different teams with separate but related responsibilities. One team asked for the redundant application to be deleted, but another was tasked with figuring out how to actually accomplish the task. This was to raise some red flags immediately.
Both teams did not use the same language and parameters and as a result had immediate communication problems. For example, one team uses the application ID to identify the software that needs to be deleted, but the other team thinks it is talking about the ID for the entire cloud instance where the applications are located.
Lesson 1: Improve internal and external communication
The teams that require changes to the network and the team that actually implements them must be the same. If not, then you need to put in solid communication tools to make sure they are in sync, use the same language and have accurate procedures. Due to incorrect communication, Atlassian engineers did not realize the extent of their error for several days.
But communication between the teams was only part of the problem. When Atlassian analyzed its communications between different managers and its customers, they found that they posted details of the one-day outage in their own surveillance systems, but were unable to contact some of their customers directly because the information contact was lost when legacy sites were deleted and other information was extremely out of date.
In addition, the deleted data contained information that customers needed to complete a valid support request ticket. Bypassing this problem requires a group of developers to build and implement a new process for issuing support tickets. The company also admits that they should have contacted earlier in the interruption period and not waited until they had a full picture of the scope of the recovery process.
This would allow customers to better plan the incident, even without specific time frames. “We had to acknowledge our uncertainty about providing a date to restore the site earlier and provide ourselves earlier for personal discussions so that our customers can make the appropriate plans. We had to be transparent about what we knew about the break and what we didn’t know.
Lesson 2: Protect customer data
Be careful with your customer data, make sure it’s up-to-date and accurate and archived in many different places. Make sure your customer data can survive a disaster and include specific checks in each book.
This raises another point about disaster recovery. During the April outage, Atlassian missed its recovery time targets (apparently given the weeks it took to restore the systems), but managed to meet its recovery point targets as they were able to recover data in just a few minutes. before the actual interruption. They also had no way of choosing a set of customer sites and restoring all their interconnected products from backups to a previous point in time in any automated way.
“Our site-level deletions in April did not have runbooks that could be quickly automated for the scale of this event,” they wrote in their analysis. “We had the opportunity to restore a site, but we had not built opportunities and processes to restore a large batch of sites.”
In the blog’s confessional, they outline their previous large-scale incident management process – you can see that it has a lot of moving parts and is not up to the task of “dealing with the depth, scale and duration of the April incident.”
Lesson 3: Test complex disaster recovery scenarios
Check and re-check your disaster recovery programs, textbooks and procedures to make sure they meet different goals. Make sure you test scenarios in all sizes of the client infrastructure. This means specifically addressing and anticipating a larger response to incidents and understanding the various complex customer relationships that use multiple products or depend on an interconnected series and sequence of your applications.
If you’re using automation, make sure your APIs work properly and send appropriate alerts when they’re not. This was one of the problems Atlassian had to deal with on the move as the outage dragged on for days.
Lesson 4: Protect configuration data
Finally, there is a problem with how the data was deleted, which started the whole interruption. They now realize that deleting data, especially on an entire site, should not be allowed. Atlassian is moving to what they call “soft erasure”, which does not immediately discard data until it is checked with defined system backs and goes through a number of precautions.
Atlassian establishes a policy of “universal soft erasure” in all its systems and creates a series of standards and internal reviews. The soft erase option is more than just an option. Do not delete any configuration data until you have tested it in your infrastructure.
Copyright © 2022 IDG Communications, Inc.