What is AutoML?
If you look up Wikipedia, you’ll read Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model. AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring becoming an expert in the field first.
Basically, it hits the nail on the head but if we were Wikipedia editors we would make a slight change where it says “to the ever-growing challenge of applying machine learning”. The actual challenge in today’s world is the ever-growing amount of data. We’ve read that biomedical data doubles every year, but although we don’t have credible statistics of the actual overall size or rate, we do know the amount of collected data is huge. Machine learning has the tools to tackle all sorts of data and amounts, but there’s not enough data scientists or analysts to go around, or enough hours in the day for them to tackle these huge amounts of data produced daily. While this may sound less significant if we’re analyzing demographics to predict election outcomes —the US 2020 Election polls proved it wasn’t mundane— it’s a lot more crucial when it comes to health data, and unlocking valuable information that can lead to cures for serious diseases. So how do we go about scaling and streamlining machine learning?
Until a few years ago the field of Machine Learning was limited to academic research and to a few select data scientists who knew how to program. This has changed significantly for three main reasons:
- The cost to store, access, and analyze Big Data has fallen significantly, thereby opening the potential for its use
- Cutting-edge technological advances including Artificial Intelligence applications and Cloud Computing has given broader access to many organisations
- Several industries have embraced technological solutions like sensors (IoT) and other smart solutions, almost eliminating downtime of assets. This means more data and an increased need for actionable insights. In the meantime, while technology has led to cost reduction it also has yielded a hunger for real-time insights.
Basically, it’s a combination of technological innovations and economic factors that has shifted machine learning from an apparatus in the academic’s lab to a power-tool on the industry floor.
Why AutoML?
Machine learning for a data scientist involves preparing data, training models, manually tuning for hyperparameters and model compression, and iterating through thousands of models which take days or weeks to do manually. AutoML aims to fully automate the machine learning process end-to-end, democratizing machine learning for non-experts and drastically increasing the productivity of expert analysts. Its goals are to completely automate the application of machine learning, statistical modeling, data mining, pattern recognition, and all advanced data analytics techniques. So while providing an efficient productivity tool it strives to reach credible results shielding against statistical methodological errors, and even surpassing manual expert analysis performance, for e.g. by using meta-level learning.
AutoML focuses on three targets:
- Accelerate human productivity while cutting costs
- Democratize machine learning for all irrespective of the level of expertise
- Improve replicability of analyses, sharing of results, and facilitate collaborative analyses
Let’s Clarify What Makes a Machine Learning Solution Truly Automated
The minimal requirements of an AutoML platform are the ability to return a. a predictive model that can be applied to new data, and b. an estimate of the predictive performance of that model, given a data source, e.g. a 2-dimensional matrix (tabular data). Thus, DIY tools that allow you to graphically construct the analysis pipeline (e.g. Microsoft’s Azure ML) are not considered as AutoML platforms. Open-source libraries and services like Scikit Learn, Weka, and Keras require coding knowledge by the user and thus are not AutoML according to the meaning suggested here. AutoML services usually include a user interface and while striving to make machine learning-friendly to coders, they also want anybody with a computer to be able to use them, typically offering a much wider range of functionalities.
Algorithmically, AutoML encompasses techniques regarding hyper-parameter optimization (HPO), algorithm selection (CASH), automatic synthesis of analysis pipelines, performance estimation, and meta-level learning, to name a few. In addition, an AutoML system could not only automate the modeling process, but also the steps that come before and after. Pre-analysis steps include data integration, data preprocessing, data cleaning, and data engineering (feature construction). Post-analysis steps include interpretation, explanation, and visualization of the analysis process and the output model, model production, model monitoring, and model updating. The ideal AutoML system should only require the human to specify the data source(s), their semantics, and the goal of the analysis to create and maintain a model into production indefinitely.
One of the first AutoML solutions was the Gene Expression Model Selector(GEMS) introduced as early as 2004 and since then several academic and commercial solutions have appeared. Comparatively evaluating them is not easy. You need lots of datasets with different characteristics, extensive computational time, and in the end finding comparative ways to challenge them fairly. Researchers Iordanis Xanthopoulos, Ioannis Tsamardinos, Vassilis Christophides, and SAP’s Eric Simon and Alejandro Salinger, argue, in Putting the Human Back in the AutoML Loop, that while “AutoML strives to take the human expert out of the ML loop, unfortunately, it seems the majority of AutoML surveys and evaluations also take the human user out of the loop, focusing solely on predictive performance and ignoring the user experience for the most part”. They endeavored into a qualitative evaluation of different AutoML platforms including Auger.ai, BigML, Darwin, H2O’s Driverless AI, RapidMiner, IBMs Watson, and JADBio, which was launched in November 2019 by Professor I. Tsamardinos focused on the analysis of molecular biological data (small-sample, high-dimensional) and an emphasis on feature selection.
As the authors of the above paper acknowledge “…statistical estimations are particularly challenging with low samples; even more so with high dimensional data. Is performance overestimated, standard deviations underestimated, probabilities of individual predictions uncalibrated, feature importance’s accurate, or multiple feature subsets returned not statistically equivalent? Which AutoML services return reliable results one can trust, and which ones are actually misleading the user and potentially harmful? In the case of medical applications, overestimating performance or confidence in a prediction (uncalibrated predicted probabilities) is dangerous and could impact human health, while in business applications it may have significant monetary costs”. While arguing that significant experimentation is needed to test all the different services they foresee a future for AutoML and predict that “…within a few years, most of data analysis will involve the use of an AutoML service or library; scripting as a means to manual ML analysis will gradually become obsolete or pass to the next level, where it is customizing and invoking AutoML functionalities”.
While prediction reliability is crucial, users also care about the interpretability of results. They need to understand patterns in their data, visualize and interpret them. They also need to be able to examine the analysis process and ensure its correctness or optimality. As Dr. Ioannis Tsamardinos, CEO of JADBio says “AutoML should automate, not obfuscate. The analysis process should be transparent, verifiable, and customizable by the user”.
Can AutoML replace Data Scientists?
Did python reduce the need for code developers? No. On the contrary, there is an ever-growing need for people who can code in python. Once a task is automated, humans move to other tasks, that yet to be automated. And there never seems to be a shortage of interesting things to do. AutoML frees data scientists to focus on formulating the analysis problem in different ways, exploring more options, interpreting results, and applying ML on data there was no time for.
As Tommy Blanchard, data science manager at Klaviyo, says on Towards Data Science “I’ve had people ask me if I’m worried about my job security as a data scientist. No, I am not. I can’t wait until these tools are there and open source so I can just type “import machinelearn” and just have it do the stupid hyperparameter optimization and I can get on with the hard part of the job”. Obviously, he understandably argues there’s more to be expected from these automation tools. We wouldn’t expect less.
AutoML is here to expedite Data Scientists’ work, permit them to attempt things, and focus on what matters, improved insights.
A Few Words About JADBio AutoML
JADBio is the culmination of 20 years of machine learning research and know-how in biomedical data analysis, several years of development, and hundreds of thousands of lines of computer code. JADBio accepts a 2D data matrix where the rows correspond to samples (e.g., molecular profiles) and the columns to features (variables, quantities, attributes). One of the features should be selected as the outcome of interest to model and learn to predict. The outcome could be a binary quantity, discrete quantity, continuous quantity, or a time to an event of interest, leading to binary classification, multi-class classification, regression, and time-to-event analysis, respectively. JADBio automatically searches in the space of combinations of algorithms for all steps of the analysis, namely preprocessing, imputation of missing values, feature selection, and modeling, and their corresponding hyper-parameter values (tuning parameters). It thus tries thousands of analysis pipelines, called configurations, to identify the best one (i.e., performs HPO) to produce the final model. It returns (a) a final model, (b) estimates of its out-of-sample (i.e., on new samples) predictive performance, and (c) the selected features after removing the irrelevant and redundant ones. JADBio performs what is called multiple feature selection returning multiple selected feature subsets that lead to equally predictive models. It also returns numerous other useful pieces of information such as features’ importance (i.e., the added value of each feature in the final performance), an explanation of each feature’s role in the model (e.g., to allow one to understand is it a risk factor or a protective factor), an indication of the samples that could be mislabeled, results on each configuration and algorithm tried, and the best approximation achieved with a humanly interpretable model (e.g., linear). The final model could be downloaded in executable form, applied to an external validation set, or run manually by feeding in the observed value of the selected features.
See It In Action
Thymoma Case Study was the result of a collaboration with the AWG on cancer subtype classification. Primary tumor biopsies from 117 thymoma patients were profiled by The Cancer Genome Atlas (TCGA) for copy number variation, gene expression, methylation levels, microRNA and genomic mutations. 4 subclasses were defined in a previous study based on all 27796 molecular measurements combined, using a multi-omics clustering approach. The 2-dimensional 117×27797 data matrix with the measurements (rows corresponding to samples from individual patients, columns corresponding to features and the outcome) was uploaded to JADBio (Figure 1, left) as a csv file. Subsequently, the column with the outcome (thymoma subclass) was selected (Figure 1, right) and the analysis begins. That was all a user was required to do. Once the analysis began, JADBio’s AI system searched the space of possible models to identify the optimal one and estimate its performance. For this analysis, JADBio trained 20890 models within 41 minutes using 16 CPU cores. The winning model turned out a Random Forest ensemble of 100 Decision Trees after feature selection with the LASSO algorithm.
JADBio presented multiple visuals, graphs, and reports. A few examples are included in Figure 2. Figure 2(a) shows the out-of-sample estimated ROC curve along with confidence intervals. The AUC averaged over all classes for the best predictive model is 0.976 (C.I. 0.931 – 1.000). Feature selection for the winning model returned 25 features. However, JADBio also reported the best approximate model that is humanly interpretable. The best interpretable model can distinguish the four thymoma subclasses based on just two molecular features with a slightly lower 0.946 AUC. There are multiple signatures reported, i.e., combinations of two features that lead to an equally predictive model (multiple feature selection). Some are shown in Figure 2(b). The first one reported is the expression values of the gene CD3E and the miRNA miR-498, respectively. miR-498 is the marker most associated (pairwise) to the outcome. However, CD3E ranks only 190 in terms of pairwise association with the outcome! In other words, if one performs standard differential expression analysis, they will have to select 189 other markers before reaching CD3E. In contrast, JADBio’s feature selection algorithms recognize and filter out the redundant features. Figure 2(c) shows the individual importance for each of the two features of the reference signature measured as the drop in relative performance when a single feature is removed from the model.
JADBio makes it easy and affordable for health-data analysts and life science professionals to use data science to discover knowledge while reducing time and effort by combining a robust end-to-end machine learning platform with a wealth of capabilities.
Grab a FREE Basic plan here – What are you waiting for?
JADBio‘s easy-to-use interface allows biologists, bioinformaticians, clinicians, and non-expert analysts to perform sophisticated analyses with the click of a button. It produces multiple visuals, graphs, and reports, in order to provide intuition, understanding, and support decision making. Novel statistical methods avoid overfitting and overestimation of performance even for low sample sizes. It performs feature selection -biosignature identification- by removing irrelevant, but also redundant features(markers) for prediction. It has been validated on hundreds of public datasets, producing novel scientific results.
Further reading:
Putting the Human back In the Loop, EDBT-ICDT-WS 2020
Just Add Data: Automated Predictive Modeling and BioSignature Discovery, Biorxiv 2020
Automated Mortality Prediction in Critically-ill Patients with Thrombosis using Machine Learning, BIBE, 2020
Accurate Blood-Based Diagnostic Biosignatures for Alzheimer’s Disease via Automated Machine Learning, PMC, 2020