Last month, software tool vendor Atlassian suffered a major network outage that lasted two weeks and affected more than 400 of its more than 200,000 customers. The interruption removed several of their products, including Jira, Confluence, Atlassian Access, Opsgenie and Statuspage.

Although only a few customers were affected in the full two weeks, the break was significant in terms of the depth of the problems found by the company’s engineers and the length of time they had to walk to find and fix the problems.

The outage was the result of a series of nasty internal errors by Atlassian’s own staff, not the result of cyberattacks or malware. After all, no customer has lost more than a few minutes of data transactions, and most customers have not noticed any downtime.

What is interesting about the whole situation with the interruption of Atlassian is how badly they managed to inform their clients about the incident and then how in the end published a long blog post that goes into great detail regarding the circumstances.

It is rare for a provider who has been affected by such a massive and public outage to make an effort to carefully put together what happened and why, and to provide a roadmap from which others can learn.

In the publication, they carefully describe their existing IT infrastructure, point out the shortcomings in their disaster recovery program, how to correct its shortcomings to prevent future disruptions, and describe the timing, work processes and ways in which they intend to improve their processes.

Copyright © 2022 IDG Communications, Inc.

Previous articleJob descriptions, salaries and skills for 2022
Next articleReplacing a fifth of meat with alternative foods can halve deforestation