Expected Improvement vs. Knowledge Gradient
In previous blog posts, we discussed the intuition behind using Gaussian processes and compared the difference between using marginal maximum likelihood and kriging variance for estimating the hyperparameters of the covariance kernels. These insights contribute to building better surrogate models of the objective function to guide our search for an optimal solution. In this blog post, we discuss another ingredient of Bayesian optimization: the sampling policy (or acquisition function). The combination of these tools empowers SigOpt to help customers efficiently optimize their critical metrics, including the accuracy of machine learning models, the performance of a reinforcement learning algorithm and the output of complex simulations.
Bayesian optimization in the Markov decision process framework
A simple illustration
Consider a problem where our decision space consists of five options, A through E, each with an outcome. Our goal is to discover the best option given a limited budget to sequentially sample the options. Let us further assume that observation from the samples are noisy.
We have said that state variable encapsulates all the information required to compute the decision. If our state variable only contains the posterior mean of the options, as seen in Figure 1, where do we want to sample next if we want to discover the best option?
In this simple example, we would want to sample option A since it has the highest estimated outcome. But what if we expand our state variable to include the posterior variance of the options, as shown in Figure 2
With the added information, this becomes a more interesting decision. If your answer is still option A (because it has the highest estimated value), then you are exploiting. If you changed to option D (because it has the highest standard deviation), then you are exploring. A good sampling policy needs to carefully balance between exploitation and exploration.
One class of policies involves defining an improvement criterion using the state variable for the next decision. The idea of using an improvement-based approach to help guide the sampling procedure can be traced back to Kushner, 1964, namely the probability of improvement policy.
One of the most well known optimization criteria is the expected improvement (EI), first introduced in Mockus et al., 1978. This idea was combined with the Gaussian processes (GP) model in the efficient global optimization [Jones et al., 1998] and sequential kriging optimization [Huang et al., 2006]. These two algorithms form the foundation for most Bayesian optimization algorithms.
The EI policy has become the de facto policy for Bayesian optimization because the EI function can be easily computed (with an analytical solution) under the Gaussian assumptions.
A close variant of the EI algorithm is the knowledge gradient (KG) policy, which was introduced in Frazier et al., 2006 for finite discrete decision space with normally distributed alternatives. Scott et al., 2010 later extends the KG policy to work with GP models.
The KG function is defined as
The exact computation of the KG function is more costly than EI, due to the maximization within the expected value.
A numerical comparison
Let us also assume the variance of the observation is 0.5 across all options and that the best incumbent outcome is 9.5. Computing the KG and EI values using these numbers, we can compare the results in Figure 3.
It is apparent from this table that the EI policy will select option A as the next sample; the KG policy, option D. The EI policy relies on the best incumbent solution as a guideline for the next sampling point, while the KG policy relies on the posterior distribution of the model for guideline. In a sense, the KG policy trusts the model and believes that option A has a lower marginal value (since it is already the optimal choice so far). The EI policy does not solely rely on the posterior distribution but also on the best incumbent solution. If our best incumbent result is 11.5 instead, the EI policy now suggests that we should be sampling option D next.
This difference in the policies also exists when we use a GP to model the objective function. In Figure 5, we can see that given the same GP model, EI policy wants to direct the next sample near the true optimal region of the function (also that of the posterior mean). On the other hand, KG policy wants to direct the sampling decision to the region where the posterior variance is high. The KG policy is confident in the model in the optimal region, and wants to sample in the region that is more likely to improve upon the current best estimate.