Uncertainty 2: Bayesian Optimization with Uncertainty
This is the second of three blog posts during which we explore the concept of uncertainty - or noise - and its implications for Bayesian optimization. This is part of our series of blog content on research that informs our product and methodologies.
- Uncertainty 1: Modeling with Uncertainty
- Uncertainty 2: Bayesian Optimization with Uncertainty
- Uncertainty 3: Balancing Multiple Metrics with Uncertainty
Intro
In our previous blog post, we discussed the importance of properly accounting for uncertainty when constructing machine learning models for regression/classification. Here we focus on how uncertainty plays an important role in Bayesian optimization. We will show how a naive strategy for optimization (random search) falls short in problems with uncertainty in the function values. Then, we will review the fundamentals of Bayesian optimization, highlighting acquisition functions that could be used in this setting. Finally, we illustrate the differences between these approaches, as well as the impact of noisy observations, for a simple optimization problem. This is our second blog post in the uncertainty series; our third post (released next week) discusses the interpretation of efficient configurations for multiple metrics which are noisy.
Black-box optimization refresher
Black-box optimization has been a critical component of many complex systems in science and engineering; one notable application is hyperparameter tuning, a key component of machine learning (ML) systems. Here, we introduce ideas fundamental to constructing a black-box optimizer which is robust to noise.
Let us begin describing a typical black-box optimization setting:
- We want to sequentially select locations ^{(1)} of the input domain, $x \in \mX$, to find the global optimum of an objective function $f : \mX \to \RR$ of interest, \[ x^\ast = \argmax_x f(x). \]
- We have no structural knowledge of $f$ beyond querying its function values at a given input location $x$.
- The objective function is expensive-to-compute and we want to find $x^\ast$ as quick as possible, i.e., few function evaluations.
When tuning hyperparameters, for example, the objective function $f$ is typically the performance of the ML algorithm, e.g., the validation accuracy of the model. The solution $x^\ast$ is the hyperparameter configuration that results in the best (validation) performance. For a neural network, typical hyperparameters that can be optimized using this technique include the learning rate, the SGD momentum and the number of nodes per layer.
As an illustrative example, let us consider the function depicted in Figure 1. Our goal is to maximize this function in the black-box setting. For ease of exposition, we will say that an optimization was successful if the optimizer returns as answer or final recommendation any location in the region of interest^{(2)}. As shown in the figure, the region of interest is a subset of the input domain corresponding to the highest function values.
Figure 1: An objective function and the set of input locations that correspond to the highest function values, the here-called region of interest.
Black-box optimization with noisy observations
In practice, our ability to measure and/or record observations from real-world problems is limited. In ML, we typically estimate the generalization performance of a model using a development/validation set. In science, the outcome of experiments are often subject to measurement error. In everyday life we typically encounter uncertainty!
To cope with these limitations, we will assume that the observed values of $f$ are corrupted by an unknown observation uncertainty, which we more concisely call noise. Specifically, we denote by $y_i$^{(3)} the observed value associated with the input location $x_i$—this is the value that was actually observed in the presence of uncertainty.
One common strategy when modeling this noisy data is to decompose the observed value as \[ y_i = f(x_i) + \varepsilon. \] We have assumed that the observed value $y_i$ has two additive components: one that represents the true/exact function value $f(x_i)$, and another one that corresponds to the imprecision of the recorded observation. This $\varepsilon$ represents a stochastic noise term—typically, centered at zero, with known noise variance $\sigma^2$ so that $\varepsilon \sim N(0, \sigma^2)$. It is possible to consider different assumptions about the noise but this one is certainly the most used due to implications of the central limit theorem.
Figure 2 shows an example of hundreds of observations sampled from the observation model described above. Notice that the observed values can be significantly different from the true function values. Identifying the region of interest in Figure 2, especially with limited observations, becomes very challenging. High function values can be seen in locations that are not part of the region of interest, which can be misleading for the optimizer. This is an important point to remember: we should not fully trust the observed values.
Figure 2: An objective function and many noisy observations sampled from an additive Gaussian noise model.
Next, we analyze potential strategies for performing black-box optimization in the presence of noise. Our function from Figure 1 will be used in two ways:
- noiseless, all observations are exact $y_i = f(x_i)$, as in Figure (1),
- with noise, where the noise standard deviation $\sigma$ is fixed to 10% of the function range, as in Figure 2.
Random search
A simple optimization strategy is grid search^{(4)}
to randomly select locations in the input space;
the one associated with the highest function value is then returned as the answer.
Why might this very simple strategy work? As long as the region of interest has a nonzero probability of
being sampled, random search (RS) will find the right answer. The question is: how long will it take?
In Figure 3, we show the probability of RS finding the region of interest in our running example for both settings^{(5)} . The difference between the noiseless and the noise case is considerable: after 30 iterations, the probability of finding the region of interest values drops from 53% to 44%; to achieve a 90% success rate, we need at least 91 function evaluations for the noiseless case and more than 1000 for the noise case! Note that this is just a simple one-dimensional problem, in higher dimensions RS will perform much worse in both settings.
Figure 3: Probability of random search finding the optimum of the illustrative function for the noiseless and 10% noise settings.
Random search is simple and easy-to-implement method but it is a less reliable method when the function values are noisy. In the next section, we describe a more sophisticated approach.
Bayesian optimization
Bayesian optimization is an active research topic devoted to efficiently solving black-box optimization (Jones et al., 1998; Brochu et al, 2010; Snoek et al., 2012; Shahriari et al., 2016 Frazier, 2018). The main idea is twofold: first, we create a surrogate model (e.g., Gaussian process GP) for the expensive-to-evaluate objective function; then, we design an acquisition function to intelligently select where to evaluate next.
Both surrogate model and acquisition function have an important role during the optimization. Here, we focus on the latter, but an interested reader might find Rasmussen and Williams (2006) particular useful for a technical discussion about arguably the most common class of surrogate models: GPs. Chapter 2 introduces the model and explains how to do predictions using noisy observations.
Acquisition functions have been extensively studied in the research community (Močkus et al., 1978; Jones, 2001; Srinivas et al., 2010; Hennig and Schuler, 2012; Hernández-Lobato et al., 2014; Wang and Jegelka, 2017). Below we will briefly mention a couple of acquisition functions that could be used for the noisy setting. For a complete review and more in-depth discussion about other acquisition functions, please check Snoek et al. 2012, Shahriari et al. 2016, or Frazier 2018.
Expected improvement (EI): Commonly used in Bayesian optimization, EI is literally the expectation of the improvement, \[ EI(x; \mD) = \mathbb{E} \big[ \max( f(x) - y_{max}, 0) \big] \] where $y_{max}$ is $\max_i y_i$. This well known formula, however, is only valid when we have exact observations of the function values $y_i = f(x_i)$. Thus, despite being widely used in practice, EI is not appropriate for noisy problems because it follows the same WYSIWYG logic of RS. With noise, the best observation $y_{max}$ may not correspond to the best true function values at the observed locations $max_i\,f(x_i)$. This may require changes to the standard EI strategy.
Expected improvement with plug-in^{(6)}: One common approximation is to use EI with a different improvement target. Instead of (the likely noisy) $y_{max}$, one can use the highest expected value of the observed point as the reference for the improvement. This approach works well when the noise level is small but it becomes inaccurate otherwise (the predictions after observing a noisy value might change, even for the already observed locations).
Noisy expected improvement (NEI): Recently, Letham et al. (2018) and Frazier (2018) (Section 5) describe how to properly derive and compute expected improvement for noisy observations. In the former paper the authors refer to this acquisition function as noisy expected improvement. Computing NEI is more involved and computationally more expensive than regular EI but it has the benefit of avoiding repeated sampling at the same locations. This acquisition function can then be safely used in the presence of noisy observations, and it reduces to regular EI when the observations are exact.
Knowledge Gradient (KG): As discussed in our previous blog post, Expected improvement vs knowledge gradient, KG (Frazier et al., 2006) is another popular acquisition function used in Bayesian optimization. The reason why expected EI and KG result in two different strategies is related to what we wish to return as a final solution. EI is more conservative and restricts the final solution to previously evaluated points; KG permits the final solution to be a location $x$ that has not been observed but is believed to be the best performing. The by-product of these assumptions is two distinct strategies for Bayesian optimization.
In contrast to standard EI, knowledge gradient has been known to perform well in the presence of noise for several years now (Frazier et al., 2009). KG properly accounts for the model differences induced by the prospect of observing a noisy observation. It also suggests the highest expected value as the final recommendation, as opposed to the highest observed value as in the noise-free expected improvement setting.
Augmented expected improvement (AEI): We also would like to mention AEI (Huang et al., 2006) as an interesting improvement-based acquisition function. AEI is easier to implement and less computationally intensive than KG or NEI. In Pichenyet al. (2013) the authors have shown that AEI was the most competitive strategy among other heuristics for noisy bayesian optimization.
Entropy search methods: Finally, we conclude this list with a family of methods known by the name entropy search. In all these strategies, the goal is to minimize the uncertainty with respect to either the optimum location $x^{\ast}$ or the highest function value $f(x^{\ast})$. Entropy search (ES) (Hennig and Schuler, 2012) and predictive entropy search (PES) (Hernández-Lobato et al., 2014) are probably the two most famous examples. And, more recently, output predictive entropy search (OPES) (Hoffman and Ghahramani, 2015) and max-value entropy search (MES) (Wang and Jegelka, 2017) were also proposed.
Numerical demonstration
By revisiting our earlier example, we can consider the performance of Bayesian optimization (using augmented expected improvement^{(7)}) in the presence of noise. In the figure below, we show the probability of finding the region of interest as more iterations transpire. In the noiseless setting, the answer is found in less than 20 iterations. In the noise setting, there is more uncertainty about the location of the global optimum due to the noise. However, after 30 iterations, the estimated probability of finding the region of interest is 97%; after 50 it always finds the right answer.
Figure 4: Probability (estimated over 1000 trials) of finding the region of interest for the running example against the number of iterations for Bayesian optimization (BO) and random search (RS) with and without noise in the observations.
Instead of focusing on the region of interest, we can also check the distribution of answers for both settings and algorithms. In the animation below, we show the probability mass for random search (top) and Bayesian optimization (bottom); the graphs on the left correspond to the noiseless setting; on the right, to the noise. The height of the orange bar matches the probabilities in the Figure 4. Bayesian optimization in both cases quickly finds the region of interest.
Figure 5: Probability mass of the final recommendations for Bayesian optimization and random search with exact observations (left) and with noisy observations (right). The orange highlighted bar indicates the location corresponding to the highest values (the region of interest).
Thus far, we have fixed the noive level to 10% of the function range. Now, we analyze the impact of different levels of noise. The figure below shows the performance of both optimizers after 30 iterations for different noise levels. As expected, the performance deteriorates as we increase the noise.
Figure 6: Probability (estimated over 1000 trials) of finding the region of interest after 30 iterations for different noise levels. The noise level on the $x$-axis has been normalized as a percentage of the function range.
Conclusion
In this blog post we presented the importance of accounting for measurement uncertainty during optimization. We discussed how we typically model the imprecisions associated with measurements of the objective function, and how those imprecisions can be deceptive for a naive black-box optimizer. Then, we briefly described the fundamentals of using Bayesian optimization in the presence of noisy observations.
In the following blog post (released next week) we will extend this discussion to multimetric optimization, especially focusing on how can we analyze the outcome of a multimetric optimizer when there is uncertainty on the observed measurements.
Special Thanks
I would like to thank Roman Garnett for our many great conversations about Bayesian decision theory for optimization. His interpretation of the role of noise in Bayesian optimization, and enthusiastic evangelization of the topic, has helped myself and others develop better strategies for sequential decision making.
^{(1)} We can sequentially select multiple locations at a time, if we are able to perform parallel function evaluations.
^{(2)} For interpreting the results, we discretized the input space. This allows us to use a probability mass function over this domain rather than a density function. This choice was mainly for visualization and convenience.
^{(3)} $i$ is a data index, for example, the time associated with the particular pair of input location and observation $(x_i, y_i)$.
^{(4)} Generally speaking, you should never use grid-search.
^{(5)} For the noiseless setting, that is straightforward to compute since the optimum has 2.5% of being sampled. For the noise case, we empirically estimate these probability running 5000 simulations.
^{(6)} The name plug-in EI appears in Picheny et al. 2012 but this strategy is fairly common and it was also mentioned in Brochu et al. 2010.
^{(7)} We ran experiments with other EI variants as well as KG and ES. The results were relatively similar, so for easier interpretability of the figure we chose to only plot AEI, the preferred method in Picheny et al. 2012.