Expected Improvement vs. Knowledge Gradient

In previous blog posts, we discussed the intuition behind using Gaussian processes and compared the difference between using marginal maximum likelihood and kriging variance for estimating the hyperparameters of the covariance kernels. These insights contribute to building better surrogate models of the objective function to guide our search for an optimal solution. In this blog post, we discuss another ingredient of Bayesian optimization: the sampling policy (or acquisition function). The combination of these tools empowers SigOpt to help customers efficiently optimize their critical metrics, including the accuracy of machine learning modelsthe performance of a reinforcement learning algorithm and the output of complex simulations.

Bayesian optimization in the Markov decision process framework

Before we dive into the different types of sampling policies, we would like to familiarize the reader with the notations used in post. We want to look at the optimization loop in the context of a sequential decision making problem or, more precisely, a Markov decision process (MDP). In this setting, the decision maker alternates between choosing decisions ( \(x^n\) ) and observing outcomes from a stochastic process \(\{Y^n\}\), as shown below:

$$ \text{choose } x^n \rightarrow \text{observe } Y^{n+1} \rightarrow \text{choose } x^{n+1} \rightarrow \text{observe } Y^{n+2} \cdots $$

We define the state of the world \(S^n\) as all the relevant information sufficient to make a decision regarding \(x^{n}\); in the standard definition, it is common to restrict the state to only include necessary information, so as to have minimal dimension. For this post, our state will always include the current estimate of the objective function, and certain sampling policies requiring additional information (such as the best incumbent observation).

For Bayesian optimization, the decisions ( \(x^n\) ) are the suggestions (or the query points). For each decision, we receive a random/noisy observation \(Y_{x^n}^{n+1}\) from the customer(1); this is the exogenous stochastic process. Using our knowledge of the state, the decision, and the observation, we can evolve to another state via a transition function S n + 1 = g ( S n , x n , Y n + 1 ) S^{n+1} = g( S^n, x^n, Y^{n+1} ) .

Because of the random nature of the observations, the idea of the optimal decision does not apply since decisions are influenced by the random outcomes. Instead, the goal is to search for a policy that is optimal in expectation (or over some risk measure)(2). A policy is a decision rule consisting of a function mapping a state to a decision for each iteration n.

A simple illustration

Consider a problem where our decision space consists of five options, A through E, each with an outcome. Our goal is to discover the best option given a limited budget to sequentially sample the options. Let us further assume that observation from the samples are noisy.

We have said that state variable encapsulates all the information required to compute the decision. If our state variable only contains the posterior mean of the options, as seen in Figure 1, where do we want to sample next if we want to discover the best option?

fig 1
Figure 1. The posterior means of the five options.

In this simple example, we would want to sample option A since it has the highest estimated outcome. But what if we expand our state variable to include the posterior variance of the options, as shown in Figure 2

fig 2
Figure 2. The full posterior distributions of the five options.

With the added information, this becomes a more interesting decision. If your answer is still option A (because it has the highest estimated value), then you are exploiting. If you changed to option D (because it has the highest standard deviation), then you are exploring. A good sampling policy needs to carefully balance between exploitation and exploration.

One class of policies involves defining an improvement criterion using the state variable for the next decision. The idea of using an improvement-based approach to help guide the sampling procedure can be traced back to Kushner, 1964, namely the probability of improvement policy.

Expected improvement

One of the most well known optimization criteria is the expected improvement (EI), first introduced in Mockus et al., 1978. This idea was combined with the Gaussian processes (GP) model in the efficient global optimization [Jones et al., 1998] and sequential kriging optimization [Huang et al., 2006]. These two algorithms form the foundation for most Bayesian optimization algorithms.

Let \(f^n(x)\) be the estimated mean of the objective function evaluate at \(x\) after \(n\) samples and \(X\) be the decision space of the problem. The \(EI\) function is defined as

E I n ( x ) = E [ max ( f n + 1 ( x ) y best , 0 ) | S n , x n = x ] , EI^n(x) = \textbf E \Big[ \max (f^{n+1} (x) − y_{\text{best}}, 0 ) | S^n, x^n = x \Big],

where the expectation is taken with respect to the random observation from decision \(x^n\). The expected improvement policy is to select the maximal argument:

X E I , n = arg max x X E I n ( x ) . X^{EI,n} = \arg \max_{x \in \mathcal X} EI^n(x).

The EI policy has become the de facto policy for Bayesian optimization because the EI function can be easily computed (with an analytical solution) under the Gaussian assumptions.

Knowledge gradient

A close variant of the EI algorithm is the knowledge gradient (KG) policy, which was introduced in Frazier et al., 2006 for finite discrete decision space with normally distributed alternatives. Scott et al., 2010 later extends the KG policy to work with GP models.

The KG function is defined as

K G n ( x ) = E [ max x X f n + 1 ( x ) max x X f n ( x ) | S n , x n = x ] , KG^n (x) = \textbf E \Big[ \max_{x’ \in \mathcal X} f^{n+1} (x’) − \max_{x’ \in \mathcal X} f^n (x’) | S^n, x^n = x \Big],

where max x X f ( x ) \max_{x \in \mathcal X} f(x) is the optimal value of the posterior mean. Analogously, the KG policy is then

X K G , n = arg max x X K G n ( x ) . X^{KG,n} = \arg \max_{x \in \mathcal X} KG^n(x).

The exact computation of the KG function is more costly than EI, due to the maximization within the expected value.

A numerical comparison

The EI and KG functions appear to be almost identical at a glance. In fact, these two are equivalent when the observations are deterministic(3). In the presence of noisy observations, these two policies behave very differently, but in what way? Let us revisit the example in Figure 2; we associate some numbers with the five options below in Table 1.

Table 1. Posterior means and variances of the five options

Let us also assume the variance of the observation is 0.5 across all options and that the best incumbent outcome is 9.5. Computing the KG and EI values using these numbers, we can compare the results in Figure 3.

fig 3
Figure 3. The KG and EI values for the five options (in log scale) when our best incumbent outcome is 9.5. The KG policy prefers option D and the EI policy prefers option A.

It is apparent from this table that the EI policy will select option A as the next sample; the KG policy, option D. The EI policy relies on the best incumbent solution as a guideline for the next sampling point, while the KG policy relies on the posterior distribution of the model for guideline. In a sense, the KG policy trusts the model and believes that option A has a lower marginal value (since it is already the optimal choice so far). The EI policy does not solely rely on the posterior distribution but also on the best incumbent solution. If our best incumbent result is 11.5 instead, the EI policy now suggests that we should be sampling option D next.

fig 4
Figure 4. The KG and EI values for the five options (in log scale) when our best incumbent outcome is 11.5. Both KG and EI policies prefers option D now.

This difference in the policies also exists when we use a GP to model the objective function. In Figure 5, we can see that given the same GP model, EI policy wants to direct the next sample near the true optimal region of the function (also that of the posterior mean). On the other hand, KG policy wants to direct the sampling decision to the region where the posterior variance is high. The KG policy is confident in the model in the optimal region, and wants to sample in the region that is more likely to improve upon the current best estimate.

fig 5
Figure 5. Comparing KG and EI policies for an unknown continuous function (in dotted orange) using GP as the surrogate model. The EI policy prefers to sample around region around 0.5 and the KG policy prefers to sample around the region around 1.5.

Conclusion

We outlined two improvement-based policies and examined the difference in their behaviors. While the EI policy is the de facto sampling policy in most Bayesian optimization algorithms, users who are confident in their statistical model might consider the KG policy for a greedier approach(4).

(1) The reason that we index the random outcome \(Y\) by the superscript \(n+1\) here is to create a consistent way of denoting whether variables are deterministic or random at iteration \(n\). More formally, the sigma algebra \(\mathcal F^n\) is generated by everything up to iteration \(n\), i.e. \(\mathcal F^n = \sigma (S^m, Y^m)_{m \leq n}\) We can define the filtration \(\{ \mathcal F^n \}_{n \geq 0}\), such that \(\mathcal F^0 \subset \mathcal F^1 \subset \cdots \subset \mathcal F\).

(2) We have used the terms sampling policy and acquisition function interchangeably thus far; however, it should be noted that they are two different concepts. The EI policy should really be called the max EI policy, since it takes the maximal argument of the EI function.

(3) The curious reader can find the derivation of this statement in Scott et al., 2010.

(4)KG is greedier in the sense that it believes in the posterior model and thus spends less time validating the posterior maximal region if the variance is low.

Harvey is interested in stochastic optimization and machine learning. He obtained his Ph.D. in electrical engineering from Princeton University, where his doctoral studies focused on approximate dynamic programming, stochastic optimization, and optimal learning; with applications in managing grid-level battery storage. He also holds a B.S. in electrical engineering from University of Texas at Austin.
Harvey Cheng, PhD