COVID-19 Research

While the battle against the SARS-CoV-2 continues, we’re contributing to the research community. We’re putting our JADBio AutoML in the service of biologists, virologists and anyone who needs to discover knowledge fast. Let us know if you’re also working on SARS-CoV-2 research to include you into our Research Licensing Program today!

SIGN UP
COVID-19 Research with JADBio

CASE STUDY

Automated Machine Learning in healthcare and medical diagnosis: COVID-19

Graphical summary of a COVID-19 dataset

Fig.1

A tremendous amount of scientific effort is currently in place to bring down the rapidly evolving coronavirus pandemic. Although large volumes of data are being collected every day, there is still huge debate on the optimal predictive models for patient stratification; the host-virus molecular footprints for drug-treatment; and the successful management of the disease. Here, we investigated the performance of automated machine learning through Just Add Data Bio; a tool specifically designed for low-sample high-dimensional biomedical datasets to analyze publicly available COVID-19 datasets. In a fast automated way and with minimal human effort, multiple new biosignatures were built with reduced feature number and high predictive performance that remained high upon validation. The models emerged are readily available for translating into COVID-19 clinically relevant cost-effective assays.

Authors: Georgios Papoutsoglou, Makrina Karaglani, Vincenzo Lagani, Naomi Thomson, Oluf Dimitri Røe, Ioannis Tsamardinos, Ekaterini Chatzaki

Fig. 1  Graphical summary of a COVID-19 dataset re-analyzed in this study (Mick et al., 2020). The task was to compare COVID-19 patients to those with another or no acute respiratory infection (ARI). To validate the predictive performance of AutoML, we performed stratified subsampling on the original data: 70% of the samples were assigned for model training and 30% for validation.

View JADBio Analysis
Analysis MethodologyFS option (JADBio)FS algorithmModeling algorithm#conf. tried#models trained#exec. timetraining estimatevalidation estimate#features selectedLink-to-results to the JADBio platform
subsampledMick et al. lassorandom forest525 0.957 (0.9 - 1)0.94426 
JADBionon-aggressivelassorandom forest30176034026min0.937 [0.883 - 0.979]0.94324link
...aggressiveSESrandom forest13934179024min0.918 [0.863 - 0.959]0.92325link
originalMick et al. lassorandom forest525 0.98 (0.951 - 1)-26 
JADBionon-aggressivelassorandom forest30176034041min0.948 [0.908 - 0.979]-49link
...aggressiveSESrandom forest13932786022min0.914 [0.865 - 0.955]-8 (2 equiv.)link
Predictive Performance Estimates COVID-19 - JADBio AutoML

Fig. 2 Predictive performance estimates in terms of AUC reported in the original dataset publication, by JADBio on the full set of available data (i.e. no samples are lost to estimation) and by JADBio on training and validation sets (subsampled). Numbers in parentheses denote the range of the estimate while the numbers in brackets the 95% confidence intervals. The equivalences denote the number of equivalent signatures found by JADBio, e.g., “8 (2 equiv.)” means that JADBio discovered 2 equivalent signatures each containing 8 biomarkers. Each link to the JADBio platform leads to a report with the complete list of AutoML results. JADBio does not overestimate when there are no samples held out for estimation; confirms the predicted performance obtained in the original publication and discovers novel signatures.

DISCUSSION

Can AutoML improve the diagnostic/predictive models for COVID-19?

In this study, we applied AutoML in order to obtain accurate diagnostic/predictive models for COVID-19, using available archived datasets. We asked; could we improve on the predictive power of the models? Can we reduce the number of measurements required without sacrificing performance to develop a cost-effective laboratory test? Can we obtain more accurate training estimates that better reflect the performance anticipated in a real life setting? Most importantly, can AutoML improve on these aspects in a fully automated mode?
Using autoML, we have affirmatively answered all these research questions. That is, our approach was on par or better than the published results or the ones obtained by running previously used code and methodology on our training sets. Quite importantly, the respective predictive performance estimates accurately reflect the performance obtained on the validation sets, so we argue that there is no need to lose samples to estimation. JADBio internally handles estimation techniques, so that the user does not have to worry about this: it performs cross-validation, repeats the cross-validation with different fold partitions for low sample size to reduce variance of estimation, stratifies the partitioning to folds of cross-validation to reduce the variance of estimation and handle imbalanced data, corrects performance estimate for the “winner’s curse” and trying multiple algorithms using the bootstrap bias correction for CV, and includes all steps of the analysis (e.g., feature selection) within the cross-validation that leads to overestimation.
Thus, we advocate the use of all data for training with JADBio. Of course, this claim comes with an important disclaimer: JADBio’s theoretical guarantees of out-of-sample performance estimate hold only when the model is applied on the same data distribution. If the models are applied in a clinical setting where measurements have batch effects, the population characteristics are different, or there are other systematic differences in the data, an external validation set from that operational environment is clearly required. An additional advantage of this AutoML approach is that it is able to work in two modes; with and without aggressive feature selection. The latter may give away some predictive power to produce models with multiple (in case of biological redundancy) equivalent biosignatures of selective predictive features providing choices to the designers of diagnostic assays. Accordingly, we were able to deliver several highly diagnostic/prognostic biosignatures of minimal feature size from different types of COVID-19 data.

REFERENCES

Mick, E., Kamm, J., Pisco, A.O. et al. Upper airway gene expression reveals suppressed immune responses to SARS-CoV-2 compared with other respiratory viruses. Nat Commun 11, 5854 (2020). https://doi.org/10.1038/s41467-020-19587-y

#COVID-19 #autoML  #MachineLearning, #SARS-CoV-2, #predictivemodels

JADBio AutoML for effortless machine learning models

 

Who is JADBio AutoML for?

 

JADBio stands for Just Add Data and aims to make machine learning accessible to all regardless of expertise or programming skills. Whether you’re a bioinformatician, a data scientist, or a non-expert in data science but interested in getting the most out of your data JADBio’s robust AutoML automates the machine learning process, making it easy and affordable to discover knowledge, while reducing time and effort. Focus on what matters, your data insights.

GET STARTED
JADBio AutoML Who is it for?

See JADBio in Action

 

Join the JADai Community!

Sign up with a FREE Basic plan! Be part of a growing community of AutoML enthusiasts

JADBio JADai