GitHub Actions became generally available on GitHub Enterprise Server (GHES) with the 3.0 release about two years ago. Since then, we’ve made many performance improvements to the product that have reduced GitHub Actions’ CPU consumption on the server and allowed us to run more GitHub Actions jobs concurrently. By the numbers, on 96-core machines, the maximum concurrent tasks went from 2200 on GHES 3.2 to 7000 on GHES 3.6 (the current version) 3x performance improvement.
Here are some of the more interesting improvements we’ve made to reach that goal and the lessons we’ve learned along the way.
Fix 1: Just make sure the cache is working
One fine day we realized that our hottest code path that we use to access workflow secrets and callback URLs is not using a cache, even though we assumed it was. This came as a surprise to our team as we have extensive monitors for every new cache we add to the product, but this particular cache is something we thought we had enabled years ago. Problems like this are hard to catch. We only caught this by analyzing profile traces collected during load testing and in production. After a simple change to enable the cache, CPU usage dropped rapidly, resulting in faster workflow execution and increased throughput. As it turns out, sometimes you don’t have to dig too deep to find a big performance win.
Patch 2: Orchestration framework improvement
How it worked before
“Orchestration” is what GitHub Actions uses to execute workflows. At a high level, this is a durable state machine that makes the workflow resilient to machine shutdowns and intermittent failures. To achieve durability, each time the orchestrator wakes up, it replays the execution from the beginning to rebuild the local state until either the code terminates or new work is encountered.
We store the orchestration state in a database table. The problem we had is that we were writing state to a single column in the database as one big blob of events.
CREATE TABLE tbl_OrchestrationSession ( SessionId BIGINT NOT NULL, CreatedOn DATETIME NOT NULL DEFAULT GETUTCDATE(), LastUpdatedOn DATETIME NOT NULL DEFAULT GETUTCDATE(), ... State VARBINARY(MAX) NULL, -- this is where we store execution state )
When we update the state for a running orchestration, we read the entire block into memory, append the new events to the end, and write it back to the database. We had unnecessary overhead as we would delete a growing blob and then have to commit a slightly larger (but almost exactly the same) value over and over again when saving state. We had to read and deserialize a big blob every time we replayed the state.
What we did instead
We adopted a new version of orchestration that supports both incremental reads and incremental writes to the database. Status history is now in its own table instead of an embedded binary block. Now when we update the orchestration state, only the new events will be recorded. It also allows us to do interesting things like caching where we can skip fetching all historical events and just fetch pending events from the database. This avoids the overhead of repetition, meaning that long-running orchestrations of multi-step workflows are less of a problem.
-- new table to store execution state CREATE TABLE tbl_OrchestrationSessionEvent ( ... SessionId BIGINT NOT NULL, EventId BINARY(20) NOT NULL, EventData VARBINARY(MAX) NOT NULL )
On GitHub.com, we’ve seen CPU consumption for running orchestrations drop by an average of 50%, with longer-running orchestrations seeing more benefit. We hadn’t invested much in the orchestration platform we depended on before we made this change. The resulting change demonstrated the importance of constantly reevaluating our approaches and core platforms as we grow and evolve.
Fix 3: Reduce load on postbacks
What is a postback?
As the workflow execution progresses, you can see the updates as checks in the UI and API. Execution state is maintained by the GitHub Actions backend service (in orchestration), but execution checks are stored in the Rails monolith. Simply put, a “postback” is the service-to-service call that directs the latest execution state to checks. Postbacks are generated while the orchestrator executes a workflow execution. The backend maintains an internal queue of return messages to send to the frontend, using a separate orchestration to run a workflow so that they are reliable and not interrupted by a service outage.
During our load testing, we found that delivering a postback was one of the slower activities, averaging around 250-300ms to complete. We also found that the backend sends one sendback for each update of a check step and three sendbacks with almost exactly the same payload when the check completes. A large amount of slow postbacks consumes a lot of system resources and can stop other activities from running, resulting in overall slowness. This was especially worrisome for the big ones matrix script.
What we did instead
We evaluated the usefulness of each return mail sent by the backend. We found that the statuses of the verification steps were only displayed in one particular UI. We decided to stop sending them during the execution of the workflow and publish step data only after the execution is complete. The lack of available step data for running runs meant that the initial navigation speed for a running run could be slower than that of a completed run due to the overhead of client-side rendering, but these were trade-offs we were willing to make . Of course, we also removed duplicate events for completed work. Both of these changes were shipped with GHES 3.3, which allowed GitHub Actions to run nearly 2x more tasks concurrently than GHES 3.2.
As for the slowness of each individual postback call, they are slow because the postback was being sent via HTTP calls to four different services, with each service manually handling retries, timeouts, etc. We are actively working on switching our postback delivery to a faster and simpler system using a message queue. The goal is to roll out the change over the next few months, hopefully increasing productivity even more.
Of course, there are other improvements we made that didn’t make it into this post (it’s already long enough). Ultimately, GitHub Actions can now perform three times as many tasks simultaneously while using fewer system resources. And while this kind of work satisfies us as engineers — 3x is a big improvement! — we also know this has a real impact on our customers who rely on GitHub Actions to get their work done and deliver code to production. We’ve learned that it’s always worth revisiting the basics of long-standing projects, especially as they scale. Going forward, we aim to continue improving our load testing automation to catch issues like the ones mentioned above before they become issues and continue to optimize performance on the GitHub platform.
How we tripled max concurrent jobs to boost performance of GitHub Actions