AutoML vs HPO vs CASH: what is the difference?

AutoML vs HPO vs CASH: what is the difference?

First Published in Towards Data Science


Automated Machine Learning, or AutoML for short, is on the rise. More and more commercial products appear on the market, academic tools, and public, open source AutoML libraries. As with every new technology that is new, unclear, and nebulously defined, AutoML is misunderstood. On one end, there are grandiose claims that it will send data analysts home, and on the other end, extreme statements that it automates only the trivial part of the analysis. This is because different people give different definitions to AutoML. Let us examine them.

The vision of (predictive) AutoML: What exactly is AutoML? Wikipedia* defines it as the automation of the machine learning process end-to-end. But isn’t machine learning automated already? I have a dataset, I run a learning algorithm (e.g., Random Forest, SVM) with the default hyper-parameter values, and boom, I have a model instance. That was automatic, wasn’t it? So, what is the fuss all about?

The goal of AutoML is to gradually replace more and more functionalities that a human expert analyst would provide. These go way beyond just delivering a predictive model. A good data scientist would provide you with graphs, visuals, interpretations, insight, and suggestions. They would determine not only the appropriate classification algorithms to try, but also specialized preprocessing for your data types, try various representations for the data or the ML task, and apply feature extraction methods. And it does not stop there. After the model is in production, the data analyst needs to keep monitoring the sanity of the predictions in case of a data statistical distribution drift. If that is the case an alarm needs to be flagged and retraining of the model needs to take place. Most importantly, a good analyst will help you with making optimal decisions based on your model, like what is the optimal classification threshold when you apply the model, and which are the factors (features, variables) that may affect your subscribers’ behavior. That is what AutoML should be all about. AutoML is the end goal, but we are not fully there just yet.

Hyper-Parameter Optimization (HPO): One of the sub-problems of an AutoML platform is to deliver the best possible model instance out of all available algorithms at its disposal. Each of these algorithms is “tunable” using hyper-parameters. Hyper-parameters are inputs to the algorithm that modify its behavior. Running the same algorithm with different hyper-parameter values could lead to quite different predictive model instances. So, the question that arises is how to tune their values and obtain the optimal model instance possible. The interpretation of the hyper-parameters is often related to their sensitivity to detecting patterns and fitting the data. For example, the lower the K in K-nearest neighbors or the lower the weight penalty lambda in ridge logistic regression, the more complex the model will be, the better the model will fit the training data, and the higher the probability of overfitting the model. Some algorithms take more than one hyper-parameter, e.g., the XGBoost algorithm accepts about a dozen hyper-parameters related to tuning learning performance! Tuning the hyper-parameters of an algorithm can make a world of a difference in predictive performance! In our personal experience, tuning “a few good algorithms” is arguably more important than using numerous algorithms with default settings. Hyper-parameter Optimization (HPO) is the problem of automating the tuning of the hyper-parameters for a given algorithm. Several basic and advanced algorithms have appeared in the literature for HPO. HPO is challenging particularly for algorithms with many hyper-parameters to tune. 

Combined Algorithm and Hyper-parameter Selection (CASH): HPO is often used in the context of optimizing the hyper-parameters of a single algorithm. However, the choice of the predictive modeling algorithm (e.g., SVM vs. Random Forests) can be encoded as a numerical hyper-parameter as well: 1 denotes SVMs, 2 denotes RFs, and so on. In addition, one may apply a series of analysis steps prior to modelling such as preprocessing, imputation of missing values, and feature selection. Accordingly, algorithm selection and hyper-parameter tuning needs to be considered for each step of the analysis. When algorithm selection takes place in addition to hyper-parameter values, the more specific term used is Combined Algorithm and Hyper-parameter Selection algorithms, or CASH.

Difference between AutoML, CASH and HPO. Several exceedingly popular, particularly useful, and quite successful freeware libraries, like auto sklearn, TPOT, and Gamma are in our opinion better described as CASH or HPO libraries and not AutoML. So, here are the differences:


AutoML aims to automate numerous functionalities of a human analyst including the production of a predictive model, estimation of its out-of-sample (i.e., on new, unseen samples) predictive performance, visualizations, interpretations, alerts about the possible problems of the analysis procedure, decision support with application of the model, monitoring of the operation of the model, automatic model retraining, and others. HPO and CASH automate the search for an optimal predictive model.


Let us illustrate the difference with an example. We consider the data from [1] measuring 503 miRNA expressions in the blood of 48 Alzheimer patients (cases) and 22 healthy subjects (controls) matched for age. The task is to learn a classification model of Alzheimer’s disease status (binary classification) and estimate its out-of-sample predictive performance. In addition, it is to identify the features (blood biomarkers in this case) that are required for optimal predictions and understand their role and importance. We apply our AutoML platform to this problem, called JADBio. You can access the actual results here.

First, JADBio solves a CASH problem to select the algorithms for the preprocessing, feature selection, and modeling steps, and their corresponding hyper-parameter values that lead to an optimally predictive model. There are 3017 different combinations of algorithms and hyper-parameter values that are tried (called configurations) to identify the winning one. Each configuration is cross-validated to estimate performance and the final winner, leading to 90510 trained model instances. The winning configuration and the analysis report are shown in the figure below:

Figure 1: JADBio’s analysis report of the Leidinger et al. 2013 [1] miRNA data on 48 Alzheimer’s and 22 control subjects. The top row of analysis steps shows the winning ML pipeline (configuration) that leads to the optimal model instance. 90510 model instances were trained (fit) in 3017 configurations within 34’. 

Up to now, you could get a similar type of output from any other CASH algorithm such as auto sklearn. Here is where the differences start. The figure below shows an estimate of the out-of-sample (i.e., on new, unseen samples) performance of the model. In contrast, a CASH library, like auto sklearn does not automate performance estimation for you. The authors suggest that you leave out a hold-out set to estimate performance. JADBio emphasizes accurate performance estimation, even for small sample sizes. It performs all necessary cross-validations internally and automatically; it also adjusts the estimate of performance for the “winner’s curse”, i.e., for the fact that numerous configurations have been tried to select the best. The “unadjusted estimate” is the cross-validated estimate without this adjustment and is an overestimation (see [2] for more information). JADBio also outputs the ROC curve. The circles on the curve indicate different operating points for the model for different probability classification thresholds. The green circle corresponds to the threshold 0.46. Classifying as having Alzheimer’s a subject with model probability higher than 0.46 makes the model operate on that point on the ROC, where it achieves a False Positive Rate of 0.15 and a True Positive Rate of 0.93. The user can click on any circle and get the corresponding tradeoff between the False Positive Rate of the model and the True Positive Rate. This is one way to support the optimization of the classification threshold to use in the model’s operating environment.  

Figure 2: JADBio reports the out-of-sample Receiving Operating Characteristic (ROC). It facilitates the optimization of the optimal classification threshold by providing estimates for different choices of the threshold. Each circle on the curve corresponds to a different threshold (bottom x-axis) and a different trade-off between False Positive Rate (x-axis) and True Positive Rate (y-axis). For example, classifying any new sample with a probability higher than 0.46 (skyscraper point on the right) as having Alzheimer’s disease, leads to the green circle selected with FPR of 0.15 and TPR 0.93. The confidence intervals in each axis are shown as the green cross.

JADBio’s configurations also include feature selection. As a result, the user is informed about the minimal set of features that are required to measure and get optimal performance. The figure (panel (A)) below shows the 3 out of the 503 features selected for model inclusion. Most CASH libraries do not perform simultaneous feature selection and modeling now, although this is subject to change. Regardless, JADBio aims not only to select the features but also to explain their role and importance automatically. ICE curves [3] are shown for each feature, trying to explain their role in the predictions. The ICE curves show, on average, the predicted probability to be an Alzheimer subject by the model as a function of the feature values. In the example, we magnify the ICE plot for the first feature (panel B). We deduce that this feature is a risk factor for the disease: the higher its value, the higher, on average, is the probability output of the model for Alzheimer’s. Notice that the exact prediction depends on all 3 selected features, and it varies depending on the values of the other two. This variance is indicated in the ICE plot by the grey area. To facilitate the interpretation of the importance, their added value is shown in the Feature Importance panel (C). It shows the expected performance drop when a feature, and only that feature, is removed from the model. Altogether, AutoML provides the practitioner with a set of functionalities to decide on which features to measure and how to interpret their role.

Figure 3: A) Signature (selected features) of the final model, B) ICE plot for feature hsa-miR-30d-5p and C) feature importance based on relative performance drop if a feature is excluded from the signature.

There are other visuals and outputs that facilitate interpretation and decision making. These include automatically identifying possibly mislabeled samples and “hard-to-predict” samples (currently available), possibly corrupted samples (“dirty data”), explanations of individual predictions (not yet available), and many more. AutoML tackles the interpretation and explanation of problems head-on. There is a shift in perspective: HPO and CASH stop when a model instance is produced. For AutoML this is just the starting point. 

References

Authors: Ioannis Tsamardinos (University of Crete and JADBio autoML), Zacharias Papadovasilakis (JADBio), Giorgos Papoutsoglou (University of Crete), and Vassilis Christophides(ENSEA, France)

[1] P. Leidinger, C. Backes and S. Deutcer, “A blood based 12-miRNA signature of Alzheimer disease patients.,” Genome Biology, vol. 14, no. 7, p. R78, 2013. 
[2] I. Tsamardinos, E. Greasidou and G. Borboudakis, “Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation,” Machine Learning, vol. 107, no. 12, pp. 1895-1922, 2018. 
[3] C. Molnar, “5.2 Individual Conditional Expectation (ICE),” in Interpretable Machine Learning. A Guide for Making Black Box Models Explainable., 2021. 

* https://en.wikipedia.org/wiki/Automated_machine_learning