# Clustering Applied to Acquisition Functions

Bayesian optimization of some function $f$ is powered by two key components:

- a probabilistic surrogate model which encapsulates the understanding of $f$ given the data that has been observed, and
- an acquisition function, which assigns a sense of benefit to sampling at a future location and which is maximized to identify subsequent suggested evaluation points.

Many different surrogate models exist
(including Gaussian processes,
random forests,
kernel density estimation,
neural networks,
among others)
and many different acquisition functions exist
(including probability of improvement,
expected improvement,
knowledge gradient,
entropy search,
upper confidence bound,
among others).

This blog post discusses recent advances in interpreting the upper confidence bound acquisition function
(abbreviated below as UCB) from a geometric perspective. In particular, this post is an abbreviated
version of an article I will be
presenting
on Tuesday, April 17 at
ICASSP 2018 in Calgary. I hope that anyone attending this conference
who is interested will join the Probabilistic Models session (MLSP-P2) at 1:30pm.

### Upper Confidence Bound

The Upper Confidence Bound^{(1)} (UCB) acquisition function balances
exploration of the function $f$ and exploitation of the current state of the surrogate model.
In the simplest case, we assume that the surrogate mdoel is a Gaussian process, meaning that the UCB
can be written as
\[
a(\xx) = -(\mu(\xx) - \kappa\sigma(\xx))
\]
where $\mu(\xx)$ and $\sigma^2(\xx)$ are the mean and variance of the surrogate model. The value
$\kappa$ is a free parameter which can be chosen so as to enforce a specific balance between
exploration (high $\sigma$) and exploitation (low $\mu$). Assuming the current data $\cD_n$ is
used to generate the surrogate model and, consequently, the UCB function $a(\xx;\cD_n)$,
the $n+1$th point to evaluate would be determined by
\[
\xx_{n+1} = \argmax_{\xx} a(\xx;\cD_n).
\]

At this point, it helps to consider a specific example. Suppose that we are given \[ f(\xx) = 1 - \sqrt{x_1 x_2} \sin (x_1) \sin (x_2) \] where $\xx = \begin{pmatrix}x_1\\x_2\end{pmatrix}$ is two-dimensional vector. A contour plot is presented in the figure below.

**Figure 1**:
Contour plot of a moderately interesting function.

After 40 observed results (red dots) are present in $\cD_{40}$, the surrogate model mean $\mu$, standard deviation $\sigma$, and UCB $a$ are computed and displayed in the figure below. The surrogate model and choice of $\kappa=1$ were designed to produce easily interpreted pictures; the details of surrogate development are beyond the scope of this article and likely involve likelihood computation.

### Maximizing the Upper Confidence Bound

As seen in the UCB graph in Figure 2, the acquisition function is
multimodal
which will require some form of
nonconvex optimization to
maximize. Such a process is expensive and potentially complicated,
often without simple guarantees of being able to easily
identify the best UCB point (the gray
star).^{(2)}

Of particular note in this acquisition function maximization scenario is the impact of the "initial guess",
a commonly required component of iterative algorithms.^{(3)} This
issue has always existed,
and appears here when convex optimization tools are used to try to identify the maximum of $a$. The figure below
shows the solutions that two standard optimizers,
Powell's method and a
modification
of Newton's method,
will find, as a function of the
initial guess (for plotting cleanliness, solutions owned by less than 2% of the domain were not provided their own
regions).

**Figure 3**:
*(left)* Contour plot of the acquisition function $a$, as appearing in Figure 2, with red stars representing
solutions found by initial guesses distributed throughout the domain.
*(center)* The 8 Powell optimizer solutions (stars) to which initial guesses in the domain converge and
corresponding regions containing those initial guesses.
*(right)* Matching results for the Newton optimizer.

For both of these methods which are dependent on an initial guess, there is only roughly a 20% chance that a randomly chosen initial guess in the domain will converge to the true solution (in gray). In higher dimensions, this will likely be even worse. The goal of our article is to attempt to improve the prospects of this optimization process by providing better initial guesses.

### Clustering to Guide Initial Guesses

In our article, we consider a geometric interpretation of UCB, by studying the interaction of prospective points in a $\mu$-$\sigma$ 2D plane. To understand the motivation for this, it can help to reorganize the UCB definition into the following form \[ \sigma(\xx) = \frac{1}{\kappa}\mu(\xx) + \frac{a(\xx)}{\kappa}. \] This should bring to mind the classic "slope-intercept" form of a line, $y=mx+b$, which is apparent in the figure below where a UCB contour plot is provided as a function of $\mu$ and $\sigma$.

**Figure 4**:
A $\mu$-$\sigma$ study of UCB. The contour plot generates lines as level sets because of the form of
UCB. Additionally, the surrogate model is sampled at 10,000 uniformly distributed random points in the domain
and the resulting $\mu$-$\sigma$ values are marked with a black dot to help indicate the region of
feasible outcomes for the given surrogate.

Obviously, applying a Powell or BFGS solver on 10000 initial guesses is quite costly and possibly more than what is required (especially for the 2D problem we are studying). Our proposed strategy is to use the physical representation shown in Figure 4 to select a subset of points in the physical domain to be used at initial guesses in the optimization tool. In particular, we identify clusters of points in $\mu$-$\sigma$ space to effectively sample regions of the physical domain which are diverse in $\mu$-$\sigma$ space. These clusters are identified through Gaussian mixture modeling. The figure below shows the outcome from clustering on the 10000 points.

**Figure 5**:
*(left)* 4 clusters have been identified from the data sampled in Figure 4. The Gaussian cluster mean
is identified with a square, and the max $a$ value for that cluster is identified with a circle.
*(right)* The physical domain is divided according to association with each of the 4 colors
(with matching color).
The input data point nearest to each of the means of the clusters is plotted with
a circle of the corresponding color. The result of a BFGS optimization at each circle is represented
with a star of the same color.

As suggested by Figure 5, even if the clusters in $\mu$-$\sigma$ space are contiguous, the associated $\xx$ locations in physical space need not be contiguous. This yields a degree of freedom regarding how to use the information in the clusters to effectively initialize the optimization. In the figure below we present two possible strategies which are discussed in greater detail in our article.

**Figure 6**:
Different initial guesses derived from the clustering present in Figure 5.
*(left)* The input data point with $\mu$-$\sigma$ value
nearest to each of the means of the clusters is represented with a square.
*(right)* The max $a$ point in each cluster is represented with a circle.

### Conclusion

We have revisited one of the traditional acquisition functions of Bayesian optimization, the upper confidence bound, and looked to a slightly different interpretation to better inform optimization of it. In our article we present a number of examples of this clustering-guided UCB method being applied to various optimization problems. If you are interested in this topic and happen to be attending ICASSP 2018, please stop by our poster session (Tuesday, April 17).

^{(1)} The term "Upper Confidence Bound" was developed in the context of multi-armed bandits, with the goal of maximizing a given function $f$. In Bayesian optimization, the stated goal is often the *minimization* of $f$, and the analogous name for such an acquisition function would be "Lower Confidence Bound". For consistency, the name UCB is preserved, but the equations have been manipulated to minimize a function.

^{(2)} This is in contrast to convex optimization, where strong theory regarding convergence rate has existed for quite some time. The humor of requiring nonconvex optimization (to maximize $a$) so as to empower nonconvex optimization (to optimize $f$) is not lost on Bayesian optimization researchers. An important distinction is that $a$ is not black-box; rather, it has a known form as well as (often) known gradients and a low evaluation cost (relative to $f$), allowing for more powerful tools to be used to maximize $a$.

^{(3)} A great deal of theory exists on the convergence of algorithms such as stochastic gradient descent for optimizing nonconvex objectives. Such results can show asymptotic convergence (from any initial location) to the global optimum for certain classes of functions. For our demonstrative purposes regarding the complexity of optimizing multimodal functions, we only use convex optimization tools for maximizing $a$ in this post.