Common Problems in Hyperparameter Optimization
Orginally given as a talk at MLConf NYC 2017. Slides.
SigOpt’s Bayesian Optimization service tunes hyperparameters in machine learning pipelines at Fortune 500 companies and top research labs around the world. In this blog post, we are going to show solutions to some of the most common problems we’ve seen people run into when implementing hyperparameter optimization.
Hyperparameters are your model’s magic numbers — values you set on your model before you train with any data. Examples include the number of trees in a random forest or the number of hidden layers in a deep neural net. Tweaking the values of your hyperparameters by just a small amount can have a huge impact on the performance of your model.
The process by which we search for “the best” hyperparameters is commonly known as hyperparameter optimization. Hyperparameter optimization is a powerful tool for unlocking the maximum potential of your model, but only when it is correctly implemented. Here, we are going to share seven common problems we’ve seen while executing hyperparameter optimization.
#1 Trusting the Defaults
The biggest mistake in hyperparameter optimization is not performing hyperparameter optimization at all. When you don’t explicitly set the hyperparameters on your model you are implicitly relying on the model developer’s default hyperparameters — and these values may be completely inappropriate for your problem. In an example from the SigOpt blog of building and tuning a TensorFlow ConvNet to predict Google Street View house digits from the SHVN dataset, we saw a 315% improvement over the baseline default hyperparameters hand optimized for a similar task using the similar MNIST dataset.
#2 Using the Wrong Metric
In the early days of Bing, Microsoft researchers used the total number of searches as a measurement of the quality of their algorithmic search engine results. After optimizing their search engine for that metric, they found that searches had indeed gone up; however, on manual inspection, the search results had become worse after the optimization. Why? The underlying assumption that if the number of searches increased, then the algorithm was performing better was false — the number of searches had gone up because users needed to search more to find what they were looking for!
Stories like this illustrate a large issue when performing hyperparameter optimization; it is designed to amplify the evaluation criterion that you, the practitioner, have chosen. If you have incorrect underlying assumptions about your metric, hyperparameter optimization has the potential to amplify those incorrect underlying assumptions.
In your model you can balance multiple, competing metrics in evaluation, such as revenue and quality. You may want to consider building a scalar-valued composite of competing metrics, or exploring the space of all possible solutions through a multi-metric approach.
Overfitting is the issue in which our model performs extremely well during training and optimization, and very poorly out of sample. You can avoid overfitting by using techniques such as cross validation, backtesting, or regularization. When using a technique like k-fold cross validation, your model evaluation for hyperparameter optimization would be the average of k model evaluations from each of the k folds of your data. Techniques like this will help ensure that the metric you optimize for correlates generalizes well to unseen data.
#4 Too Few Hyperparameters
When building a machine learning pipeline, from raw data to feature extraction to model building, often feature extraction will involve tunable parameters like transformations or learned feature representations. It is important to remember to optimize your feature parameters as well to get maximum performance. In this example from the SigOpt blog we built an xgboost classifier for SVHN digits and show compelling results for tuning your feature parameters at the same time as your model hyperparameters. It is recommended that you optimize all hyperparameters of your model, including architecture parameters and model parameters, at the same time.
An optimization method is the strategy by which the next set of hyperparameters are suggested during hyperparameter optimization. There are many different optimization methods to choose from, and they will all have different setup steps, time requirements, and performance outcomes.
When you manually tweak the values of your hyperparameters, you are the optimization method. And you are most likely an inefficient optimization strategy. At the end of the day, humans are usually poor at performing high dimensional, non-convex optimization in their heads. In this example from the SigOpt blog we show that algorithmic optimization can beat out hand tuning for a deep neural net in a number of hours, only requiring knowledge of a bounding box for the hyperparameters representing algorithmic weights or the structure of a neural net. Choosing an algorithmic optimization method will save you time and help you achieve better performance. Below, we go over the differences between some of the most popular methods.
#6 Grid Search
Grid search a very common and often advocated approach where you lay down a grid over the space of possible hyperparameters, and evaluate at each point on the grid; the hyperparameters from the grid which had the best objective value is then used in production. At SigOpt, we are not fans of grid search. The most prominent reason is that grid search suffers from the curse of dimensionality: the number of times you are required to evaluate your model during hyperparameter optimization grows exponentially in the number of parameters. Additionally, it is not even guaranteed to find the best solution, often aliasing over the best configuration.
#7 Random Search
Random search is as easy to understand and implement as grid search and in some cases, theoretically more effective. It is performed by evaluating n uniformly random points in the hyperparameter space, and select the one producing the best performance.
The drawback of random search is unnecessarily high variance. The method is, after all, entirely random, and uses no intelligence in selecting which points to try. You are relying on luck to get good results. Intelligent methods like simulated annealing, Bayesian optimization (used by SigOpt), genetic algorithms, convex optimizers, and swarm intelligence methods, can produce better performance for your model compared to grid search and random search. Furthermore, these methods requires less evaluations of your model, and have lower variance because they intelligently search the parameter space. Best of all, in the last few years, simple interfaces have been published for many of these methods meaning that you don’t need a PhD in mathematics to bring the most sophisticated techniques from research into your model-building pipeline.
Hyperparameter optimization is an important part of any modern machine learning pipeline. To achieve performance gains, though, it must be implemented correctly. We hope that after reading this blog post you will avoid some of these common mistakes in hyperparameter optimization. Thanks for reading, and happy optimizing!
SigOpt provides a simple API for hyperparameter optimization. With three easy API calls you can integrate a sophisticated ensemble of Bayesian Optimization techniques into your machine learning pipeline. Learn more at sigopt.com/getstarted.