CASE STUDY

AutoML Optimizes and Accelerates COVID-19 Predictive Modeling

Automated Machine Learning Optimizes and Accelerates COVID-19 Predictive Modeling

Standard machine learning analysis of proteomic and metabolomic data from COVID-19 patients produced biosignatures which contain large numbers of predictors, hampering their clinical application. Moreover, their performance often drops significantly when validated in independent groups, which is expected as sample numbers are often inevitably low. By applying automated Machine Learning, we attempt to improve modeling and deliver models/signatures that can be readily available for diagnostic assays to aid the fight against the pandemic.

Georgios Papoutsoglou, Computer Science Department, University of Crete Makrina Karaglani, Laboratory of Pharmacology, Medical School, Democritus University of Thrace Vincenzo Lagani, Institute of Chemical Biology, Ilia State University Naomi Thomson, JADBio Dimitri Oluf Røe, Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology & Clinical Cancer Research Center, Department of Clinical Medicine, Aalborg University Hospital Ioannis Tsamardinos, JADBio & Institute of Applied and Computational Mathematics, Foundation for Research and Technology–Hellas Chatzaki Ekaterini, Laboratory of Pharmacology, Medical School, Democritus University of Thrace & Institute of Agri-Food and Life Sciences, Hellenic Mediterranean University Research Centre www.nature.com/articles/s41598-021-94501-0

Abstract

The rapid outbreak of COVID-19 brings intense pressure on healthcare systems, with an urgent demand for effective diagnostic, prognostic and therapeutic procedures. Despite the global scientific effort, there is lack of efficient predictive models for patient stratification and successful management of the disease. Here, we employed Automated Machine Learning (AutoML) to analyze 3 publicly available COVID-19 datasets, including serum proteomic, metabolomic and transcriptomic measurements. Pathway analysis of the selected features was also performed. Analysis of a combined proteomic and metabolomic dataset produced ten equivalent signatures of two features each, with AUC 0.840(CI 0.723 – 0.941) in discriminating severe from non-severe COVID-19 patients. A transcriptomic dataset led to two equivalent signatures of eight features each with AUC 0.914(CI 0.865 – 0.955) in identifying COVID-19 patients from those with a different acute respiratory illness. A second transcriptomic dataset led to two equivalent signatures of nine features each with AUC 0.967(CI 0.899 – 0.996) in identifying COVID-19 patients from virus-free individuals. Multiple new features emerged implicated in a wide range of pathways including viral mRNA translation pathways, interferon gamma signaling and Innate Immune System. In conclusion, by application of AutoML multiple biosignatures were built in a fast automated way, presenting reduced feature number and high predictive performance that remained high upon validation. These favorable characteristics are eminent for further development of cost-effective clinical assays to contribute to better disease management. Our results also highlight the importance of revisiting precious and well-built datasets for maximal conclusion extraction from a given experimental observation. Funding Statement: No funding was received for this research. Read more on Automated Machine Learning Optimizes and Accelerates COVID-19 Predictive Modeling.

Automated Machine Learning Optimizes and Accelerates COVID-19 Predictive Modeling
The AutoML approach has several advantages. A. It significantly enhances productivity, which is particularly important in emergency situations, as with COVID-19, where data need to be analyzed immediately to inform public policy. The execution time of the analyses presented ranges between 8-73 minutes, while the human effort required takes a few clicks. B. It allows democratization to life scientists, as it can be performed by non-expert analysts through a graphical user interface, meaning that a medical doctor has immediate access to the results without relying on analysts for modeling and interpretation. C. It guarantees correctness in the sense that the produced performance estimates follow best-practices in the field and are not overestimated, avoiding common methodological pitfalls that are frequently encountered in ML analyses of omics data. For example, a common issue that leads to inflated performance estimates is the pre-filtering of features by accounting label information on the complete dataset (e.g., by differential expression) and then cross validating only the modeling algorithm on the same data (see Methods). This approach can significantly overestimate performance of the final model. D. It guarantees optimization, as the returned models are competitive, in terms of predictive performance, against human expert models. In addition, the returned signatures (selected feature subsets) are often smaller than the ones returned by humanly crafted code scripts. This is of particular significance if the returned models are directed to be translated to benchtop assays for clinical use. Of course, the human data scientist is still required to collect and prepare the data, and most importantly, formulate a science problem or working hypothesis into a machine learning problem. E. It allows data and model provenance, as replicability and reproducibility of results is possible. This addresses the problem of reproducing the ML results of a published paper due to changes in the code over time. In AutoML, in functional results links (like those presented here) the code version used is recorded and could be reverted back to reproduce an old analysis.

Limitations of the present study is the small number of datasets and the subjects included. As more -omics readings will become available from COVID-19 related study groups, the same approach might deliver better classifying models. In addition, the partitioning to training and validation of the available datasets was performed only once. Ideally, one would repeat the process several times and report the average behavior. However, this approach would make the exposition of the biological results less clear. Finally, a major limitation is the lack of external validation sets for the identified signatures and models.

#COVID19, #automatedMachineLearning, #SARS-CoV-2, #modeling, #predictivemodels, #validation

OTHER

CASE STUDIES

Do you have questions?

JADBio can meet your needs. Ask one of our experts for an interactive demo.

Stay connected to get our news first!

Join the JADai Community!

Sign up with a FREE Basic plan! Be part of a growing community of AutoML enthusiasts

JADBio JADai