A stylized image of CLASSIX clustering results overlaid on an illustration of the coronavirus. Credit: CDC: phil.cdc.gov/Details.aspx?pid=23312
× close
A stylized image of CLASSIX clustering results overlaid on an illustration of the coronavirus. Credit: CDC: phil.cdc.gov/Details.aspx?pid=23312
Scientists at the University of Manchester and the University of Oxford have developed an AI framework that can identify and track coronavirus variants of concern, and could potentially be useful for other infectious diseases in the future.
The framework combines dimensionality reduction techniques with a new explainable clustering algorithm called CLASSIX, developed by mathematicians at the University of Manchester. This allows us to quickly identify groups of viral genomes that may pose future risks from vast amounts of data.
Research published in journals PNAScould support traditional methods of tracking virus evolution, such as phylogenetic analysis, which currently requires extensive manual curation.
Roberto Kawanzi, a researcher at the University of Manchester and lead author of the paper, said: ‘Since the emergence of COVID-19, there have been multiple waves of new variants, increased transmissibility and immune responses. “We’ve seen an increase in the number of cases being avoided and the number of cases becoming more severe.” sick.
“Scientists are currently ramping up efforts to identify alarming new variants, such as alpha, delta, and micron, at the earliest stages of their emergence. We need to find ways to do this quickly and efficiently. Hopefully, we can be more proactive.”Responses such as the development of customized vaccines could eliminate variants before they become established. ”
Like many other RNA viruses, COVID-19 evolves very rapidly due to its high mutation rate and short generation time. This means that it takes a lot of effort to identify new strains that are likely to become a problem in the future.
Currently, approximately 16 million sequences are available in the GISAID database (Global Initiative for All Influenza Data Sharing), which provides access to genomic data for influenza viruses.
Diagram showing the steps of the proposed method to identify emerging coronavirus disease (COVID-19) variants.Credit: University of Manchester
× close
Diagram showing the steps of the proposed method to identify emerging coronavirus disease (COVID-19) variants.Credit: University of Manchester
Mapping the evolution and history of all COVID-19 genomes from this data is currently taking vast amounts of computer and human time.
The method described allows you to automate such tasks. The researchers processed 5.7 million high-coverage sequences in just one to two days using standard modern laptops. This is not possible with existing methods, reducing the need for resources and placing the identification of relevant pathogenic strains in the hands of more researchers.
Thomas House, Professor of Mathematical Sciences at the University of Manchester, said: “An unprecedented amount of genetic data has been generated during the pandemic and we need improved methods to analyze it thoroughly. Data is growing rapidly. “We continue to do so, but the benefits of collecting them have not been demonstrated.” This data is at risk of being deleted or deleted.
“We know that human experts have limited time, so our approach is not to completely replace human work, but rather to collaborate with humans to get the job done faster. We need to be able to do that and free up our expertise to do other important developments.”
The proposed method works by breaking down the coronavirus’s genetic sequence into small “words” (called 3-mers) that are represented as numbers by counting. It then uses machine learning techniques to group similar sequences based on word patterns.
Stefan Güttel, Professor of Applied Mathematics at the University of Manchester, said: “The clustering algorithm we developed, CLASSIX, is much less computationally intensive than traditional methods and is fully explainable, meaning that the text and It provides a visual explanation.”
Roberto Cahuantzi further added, “Our analysis serves as a proof of concept, allowing machine learning techniques to be used as a warning tool for early detection of emerging major variants without relying on the need for phylogenetics. It shows potential.”
“While phylogenetics remains the ‘gold standard’ for understanding viral ancestry, these machine learning methods can process orders of magnitude more sequences and at lower computational cost than current phylogenetic methods. ”
For more information:
Cahuantzi, Roberto, Unsupervised identification of significant lineages of SARS-CoV-2 using scalable machine learning methods; Proceedings of the National Academy of Sciences (2024). DOI: 10.1073/pnas.2317284121. doi.org/10.1073/pnas.2317284121
Magazine information:
Proceedings of the National Academy of Sciences