Although only less than three years old, the CARID-causing SARS-CoV-2 virus is perhaps the most studied and genetically sequenced pathogen in history. Disease surveillance teams around the world have uploaded millions of virus sequences to public databases that allow researchers to track how the virus is spreading.
A new computational model excavated this unprecedented amount of data – more than 6.4 million SARS-CoV-2 sequences – to find patterns among the mutations that help a new virus strain spread around the world. The model called PyR0, analyzed how different viral lines originated and spread between December 2019 and January 2022. From this data, we learned how to identify combinations of mutations and the amount of time required for variants such as Delta or Omicron to become predominant. The model described by a team of researchers in science in May, it can give public health programs advance notice of which lines are potentially dangerous and allow employees to plan ahead.
PyR0 uses data leading to mid-December 2021 to correctly predict that Omicron’s BA.2 subvariant, which was rare in much of the world at the time, would soon spread rapidly. By March 2022, BA.2 had become the dominant strain worldwide. If the model had been launched in November 2020, it would also have correctly predicted that the Alpha option would soon become dominant: the World Health Organization identified Alpha as an option of concern by December of that year.
Most COVID vaccines target the virus’s protein, which it uses to enter cells. Mutations in this protein appear to allow certain variants to escape the body’s immune response to the virus from vaccination or previous infection. PyR0 model found that the mere presence of multiple mutations in thorn proteins does not necessarily make a strain more evolutionarily suitable. But several specific spike mutations in late 2021 helped the Omicron BA.1 and BA.2 subvariants evade the immune system.
PyR0 also found that a set of spike-free mutations in the BA.2 genome that affect how the virus replicates may contribute to its rapid spread. The model’s ability to quickly analyze entire genomes, researchers say, could help scientists know which areas of the virus’s genome to study to develop future therapies.
Scientific American talks to study co-author Jacob Lemieux, a researcher on infectious diseases at the Massachusetts Institute of Technology and Harvard University and a physician at Massachusetts General Hospital in Boston, about how algorithms that “learn” from large datasets can predict the future. of the pandemic.
[An edited transcript of the interview follows.]
What can PyR0 tell us about the following prevailing options?
We cannot necessarily say what will happen next in terms of mutations. We can tell what will happen next in terms of which lines are most likely to increase in frequency.
In other words, if one car travels at 70 miles per hour and another car travels at 35 miles per hour, we can predict that for a certain period of time the car at 70 miles per hour will catch up and overtake the other car. But these predictions are only good in the near future, because the way the pandemic works is that suddenly there is a car at 210 miles per hour that comes out of nowhere and completely changes the dynamics.
The amazing thing is that this happens again and again. First was the D614G variant, then Alpha, then Delta, then Omicron; now it is Omicron BA.2 and its close cousins BA.4 and BA.5. So this kind of dynamic seems to be a common feature of the pandemic.
But the things that allow cars to move fast – the properties that give this an advantage in the gym – seem to have changed over time. In particular, Omicron appears to be highly immune, especially as it avoids the human antibody response. This property is becoming increasingly important for the virus, and it makes sense because so many people have either had COVID, or been vaccinated, or both.
It seems that this growing avoidance of immunity has been constantly evolving throughout the pandemic and has now really reached its full expression. This is not the first study to show this, but it demonstrates it systematically. And it seems likely that such an immune escape will continue to be part of what makes the line grow. We cannot predict, in the context of this study, what mutations will occur in the future and give additional immune escape.
How does your model help predict and track new variants?
What we model is how different combinations of mutations in different lines affect the growth rate of individual viral variants in a population. [Editor’s note: A lineage is a group of variants with a common ancestor.] Since each new line has a constellation of mutations – some of which we have seen before in other lines – we can begin to ask ourselves, “Which mutations cause this?”
We model this question in many different regions of the world and then essentially summarize the information in one model. The reason we can do this is because people around the world sequence the virus and mark the sequences with the date and region of the collection. So we know in different regions which lines increase in frequency compared to the others. This information is incredibly valuable – we could not create our model without this kind of information.
The real computational challenge is to actually apply this model and fit it into the data. The study’s lead author, Fritz Obermeier, came to the Broad Institute of Uber AI, where researchers have developed a programming language and software framework that uses machine learning to model probabilities and apply them to large data sets. It was really amazing to be able to apply these methods to a scale of data that we have never had before.
We are trying to improve the model and we have a new version. In fact, we believe that successful genera are driven by a small number of mutations, and others are simply ready to travel. A related challenge is trying to study the genetic or statistical interaction between mutations. Maybe Mutation 1 makes the virus more appropriate; maybe Mutation 2 makes it more appropriate. But maybe the combination of 1 and 2 together actually makes it less appropriate. These types of interactions are really difficult to deal with because they are growing so fast.
How can this model help us plan our response to the pandemic?
One of the things we are learning is that genome sequencing of emerging viruses is part of the epidemic’s response. We see a lot of genome sequencing, for example with the monkeypox epidemic that is happening right now.
There is so much data that we can’t just make one look at it. We need systematic, statistical machine learning programs that help people discover new variants. As a tool to help monitor the disease, this type of approach can be really useful. We’re trying to automate this model so we can run it regularly and see if we can mark things we need to worry about.
We found that by modeling mutations instead of just lines, the model is smarter and learns faster. And the sooner you learn about the properties of a line, the more you know how concerned you must be.
I do not think that this model is a substitute for well-structured programs – such as those run by governments and international organizations – for disease surveillance. This is a utility tool for such programs that allows them to systematically check and rank the rising lines. I think this kind of approach will be feasible in the future as data on influenza and other viruses accumulate.
https://www.scientificamerican.com/article/this-ai-tool-could-predict-the-next-coronavirus-variant/