Researchers used artificial intelligence (AI) to discover 70,500 viruses previously unknown to science1. Many of them are strange and completely different from any known species. RNA viruses were identified using metagenomics, where scientists sample all the genomes present in the environment without cultivating individual viruses. This method shows the potential of AI to explore the “dark matter” of the RNA virus universe.
Viruses are ubiquitous microorganisms that infect animals, plants, and even bacteria, but only a few have yet been identified and described. Artem Babayan, a computational virologist at the University of Toronto in Canada, says there is “essentially a bottomless pit” of viruses to discover. Some of these viruses can make people sick. That means characterizing the virus could help explain mysterious illnesses, he says.
Previous research has used machine learning to find new viruses from sequence data. A new study published this week in Cell takes that research a step further and uses it to probe predicted protein structures.
The AI model incorporates a protein prediction tool called ESMFold, developed by researchers at Meta (formerly Facebook, headquartered in Menlo Park, California). A similar AI system, AlphaFold, was developed by researchers at Google DeepMind in London and won the Nobel Prize in Chemistry this week.
missed virus
In 2022, Babaian and his colleagues searched 5.7 million genome samples archived in publicly available databases and identified approximately 132,000 new RNA viruses2. Other groups are leading similar efforts3.
However, because RNA viruses evolve rapidly, existing methods for identifying RNA viruses in genome sequence data will likely miss much. A common method is to look for sections of the genome that code for a key protein used in RNA replication called RNA-dependent RNA polymerase (RdRp). But if the sequence encoding this protein in the virus differs significantly from the known sequence, researchers won’t be able to recognize it.
Shi Mang, an evolutionary biologist and co-author of the Cell study at Sun Yat-sen University in Shenzhen, China, and his colleagues began searching for previously unrecognized viruses in publicly available genome samples. Ta.
They developed a model called LucaProt using the “transformer” architecture that underpins ChatGPT and fed it with sequencing data and ESMFold protein prediction data. They then trained a model to recognize viral RdRps and used it to find sequences encoding these enzymes in large portions of the genomic data, evidence that they belong to a virus. Ta. Using this method, they identified about 160,000 RNA viruses, including very long viruses found in extreme environments such as hot springs, salt lakes, and the air. Just under half of them have never been described. They discovered “a small pocket of RNA virus biodiversity, far removed from the benefits of the evolutionary universe,” Babaian says.
“This is a very promising approach to expanding the virosphere,” says Jackie Maher, an evolutionary virologist at CSIRO’s Australian Center for Disease Control in Geelong. Characterizing viruses will help researchers understand the origins of microbes and how they evolved in different hosts, she says.
And by expanding the pool of known viruses, Babaian says, it becomes easier to find more similar viruses. “Suddenly you can see things that you couldn’t see before.”
The team was unable to determine the host of the virus they identified, so further investigation is needed, Mahar said. The researchers are particularly interested in knowing whether any of the new viruses infect archaea, an entire branch of the tree of life that RNA viruses have not clearly shown to infect. .
Shi is currently developing models to predict the hosts of these newly identified RNA viruses. He hopes this will help researchers understand the role of viruses in environmental niches.