In April, we experienced two different incidents that led to significant impacts and deteriorating availability for Codespaces and GitHub Packages.

April 1 7:07 AM UTC (duration 5 hours and 32 minutes)

Our signals have found an increase in failures in creating new code spaces and launching existing discontinued code spaces in the US West region. We immediately updated the GitHub status page and began investigating.

After further investigation, we found that some of the secrets used by the Codespaces service have expired. Codespaces maintains warm resource groups to protect our users from recurring damage to our dependent services. In the western region of the United States, however, these basins were depleted of resources due to a leaked secret. In this case, we did not have enough early warning for pools reaching low thresholds, and we did not have time to react until we ran out of capacity. As we worked to mitigate the incident, swimming pools in other regions also emptied due to a leaked secret, and those regions also began to see failures.

A limited number of GitHub engineers had access to spin the secret, and communication problems delayed the start of the secret refresh process. The leaked secret was eventually renewed and disseminated in all regions, and the service was returned to full operation.

To prevent this pattern of failure in the future, we are now checking for leaking resources and have monitors that warn very early if pool resources are not supported. We have also added monitors to notify us earlier when we approach resource depletion limits. In addition, we have started migrating the service to use a mechanism that does not rely on secrecy or the need to rotate credentials.

April 14 8:35 PM UTC (duration 4 hours and 53 minutes)

We are still investigating the contributing factors and will provide a more detailed update in the availability report in May, which will be published on the first Wednesday in June. We will also share more about our efforts to minimize the impact of future incidents.

April 25 8:59 AM UTC (duration 5 hours and 8 minutes)

During this incident, our alert systems detected increased CPU usage in one of the GitHub Packages Registry databases, which began approximately one hour before the customer impact. The threshold for this warning was relatively low and was not paging, so we did not investigate it immediately. The CPU continued to grow in the database, causing the registry of packets to respond to requests with internal server errors, which ultimately caused an impact on clients. This increased activity is due to the large volume of the Create Manifest command, used in an unexpected way.

The adjustment criteria configured at the database level were not sufficient to restrict the above command and this caused an interruption for anyone using the GitHub registry. Users could not push or download packages, nor did they have access to the package user interface or the landing page of the repository.

After an investigation, we found that there was a performance error due to the large volume of “Create Manifesto” commands. To limit the impact and restore normal operation, we have blocked the activities that cause this problem. We are actively monitoring this issue by improving packet speed limits and correcting the performance issue that has been identified. We have also changed the alert thresholds and the severity of the database so that we receive signals of unexpected problems faster (instead of after the impact on the client).

During this incident, we also found that the repository homepage is heavily dependent on package infrastructure. When the package registry does not work, storage home pages that list packages also fail to load. We separated the list of packages from the repository homepage, but this required manual intervention during the interruption. We’re working on a fix that weakly binds the package list, so if it fails, it doesn’t remove the repository homepages that list packages.

in summary

We will continue to keep you informed of the progress and investments we are making to ensure the reliability of our services. Please follow ours status page for real-time updates. To learn more about what we’re working on, check out the GitHub Engineering blog.

GitHub Availability Report: April 2022

Previous articleMicrosoft has updated the PowerShell extension to VS Code
Next articleBook a job interview through a messaging app? It could be a scam