Covariance Kernels for Avoiding Boundaries

Here at SigOpt, Gaussian processes and reproducing kernel Hilbert spaces (RKHS) are important components of our Bayesian optimization methodology. Our research into this topic often exists at the intersection of approximation theory and spatial statistics. We recently attended the SIAM Computational Science and Engineering 2017 meeting in Atlanta to continue learning about outstanding accomplishments in these and other exciting topics in computation.

In addition to participating in the career fair, we also organized a minisymposium to bring together experts in Gaussian processes and RKHS and discuss recent advances that are relevant to Bayesian optimization. First, we would like to extend our thanks to Jian WuJie Chen and Greg Fasshauer for contributing talks to the session; their presentations provided a diverse set of insights into new strategies for improving the viability and performance of RKHS theory in Bayesian optimization. This post serves to discuss our ongoing collaboration with Dr. Fasshauer and explore the topic of his talk at a level suitable for a broad audience. For a more thorough discussion of the interplay between Green’s functions and covariance kernels, we will be releasing an In Depth blog post soon.

Covariance Kernels

A fundamental component of a Gaussian process is its covariance kernel, the mechanism by which information at different locations is believed to jointly vary. This defines the degree to which knowledge at one location provides knowledge at another; it also provides the key tool by which existing function observations can be used to predict function values at unobserved locations.

As such, the choice of covariance kernel can play a significant role in the quality of the kernel-based approximation to the function and, in turn, the quality of any Bayesian optimization in which that approximation is involved. We have previously discussed the topic of optimally choosing covariance kernels from a parametrized family using maximum likelihood estimation as well as the value of alternate parametrization metrics. In this post we explore the potential for a new family of kernels, with specially designed properties, to yield benefits during Bayesian optimization.

It should be noted that the role of positive definite (covariance) kernels extends beyond our application here in spatial statistics. Their role in support vector machines, studying text data, high dimensional integration, density estimation and other fields demonstrates their broad value. In this discussion, however, we will be focusing only on their role in scattered data approximation.

Isotropic and Anisotropic Radial Covariance Kernels

Before we talk about new kernels, it would be prudent to review kernels already common in the literature, especially as this affords us an opportunity to look at some pictures. By far, the most popular covariance kernels across disciplines are radial (thus the term radial basis function), and among those the most popular is the Gaussian kernel (also referred to, in a persistent bit of misnomer, as simply the RBF kernel). The images below show examples of these radial Gaussian (a.k.a. squared exponential) kernels with different shape parameters.

image
Figure 1: Isotropic radial Gaussian covariance kernels with different shape parameters. (left) A smaller shape parameter yields a larger length scale and interaction between data at longer distances. (right) A larger shape parameter yields a shorter length scale and and thus more localized behavior.

These are examples of radial kernels (also called isotropic), but not all kernels are radial. Radial kernels can be converted from their isotropic form (where all dimensions have the same bandwidth/length scale) to their anisotropic form (where dimensions can act differently, also know more generally as stationary or, in some communities, automatic relevance determination) by redefining the sense of distance. Below are two anisotropic variants on the Gaussian kernels above with a skewed sense of distance.

image
Figure 2: Anisotropic Gaussian covariance kernels. (left) The x dimension has a wider length scale than the y dimension. (right) A skewed sense of distance yields a longer length scale in the x + y direction than in the x - y direction.

Nonstationary Iterated Brownian Bridge Covariance Kernels

All of the kernels described above, and actually most common covariance kernels, have the property that the distance between observations defines the covariance (thus the term radial). The reasons for their popularity are both practical (computational benefits exist for radial kernels) and historical (this article reviews much of this history regarding completely monotone functions and interpretations in the context of the radial Fourier transform).

Nonradial kernels have, in the spatial statistics literature, enjoyed less popularity. Part of this is for theoretical reasons: kernels which are a function of only a distance quantity can then be studied as functions of one variable. In the recent past, however, there has been a renewed emphasis on the potential benefits of using nonradial kernels, partly driven by the numerical/functional analysis community (this text provides some context).

The potential benefits of using nonradial kernels include the opportunity to model functions which vary in a nonstationary way; because these kernels are not translation invariant they have the ability to recognize variations in the function on different scales in different parts of the domain. See, e.g., this article for more discussion on the topic. One point of growth has been in the use of kernels defined through Green’s functions solutions to differential operators, a topic normally restricted to the solution of differential equations but which serves as a previously untapped source of positive definite kernels. Early papers on this topic include these two articles. An upcoming blog post will discuss this topic more thoroughly.

To put a tangible face to this topic, we can look at examples of one dimensional nonradial kernels belonging to the iterated Brownian bridge (IBB) family of kernels (introduced first in this article, and later in this text).

image
Figure 3: Several iterated Brownian bridge kernels, centered throughout the domain, all of which are zero on the boundaries. (left) Certain parameters can yield nonsmooth kernels. (right) Other parameters can yield kernels with greater smoothness and locality.Type image caption here (optional)

Those readers that are familiar with the Matérn class of positive definite kernels may see some similarities to the kernels in the figure above. This is not a coincidence: the full space Matérn kernels share the same differential operator, but the kernels above are only defined on a compact domain with specific boundary conditions. Originally, in fact, these IBB kernels were given the title “Compact Matérn” for this reason. A visual comparison between the full space Matérn and IBB kernels is shown in the figure below.

image
Figure 4: IBB kernels(0) in blue and standard Matérn kernels in dashed red; they behave similarly near the middle of the domain. (left) Nonsmooth kernels.  (right) Kernels with continuous derivatives.

In the setting as I have described it so far, these kernels only exist in 1D; this can be extended to higher dimensions by considering product forms of these kernels. The figure below demonstrates this in 2 dimensions (beyond 2D it is rather difficult to visualize). In a future post we will more thoroughly study the performance of these tensor product kernels in higher dimensional spaces, where the pain of searching around the boundary is amplified.

image
Figure 5: Tensor products of these IBB kernels can be used to define covariances in more than one dimension. (left) A nonsmooth kernel. (right) A kernel with continuous derivatives.Type image caption here (optional)

IBB Kernels in Bayesian Optimization

These kernels were originally invented with the goal of using them in a collocation solution of boundary value problems: by automatically enforcing boundary conditions, these kernels are an attractive basis for which to approximate a solution. Chapter 20 of this book provides an example of this strategy.

Recently, we decided that these IBB kernels could potentially be adapted for use in Bayesian optimization. In particular, one common complaint about some Bayesian optimization methods is their propensity to sample near the boundary; this problem was addressed recently in this article which describes using a quadratic mean to dissuade samples from the boundary. Our strategy here is to use these IBB kernels to enforce behavior near the boundary, and thus decrease the desire to sample around the boundary until the interior has been sufficiently explored.

In the figures below, we demonstrate the impact of this choice in comparison to a standard Matérn kernel with the same smoothness and shape parameter; a deterministic linear mean is assumed in these kriging models. Eight initialization points, randomly selected from the domain, are used to seed the optimization. These examples involve fixed covariances over the course of the optimization.(1)

image
Figure 6a: The evolution of the optimization using a radial Matérn kernel may result in an inefficient sampling in the corners and around the boundary.
image
Figure 6b: By using the iterated Brownian bridge kernel with the same smoothness and locality properties, there is less desire to sample on the boundary.

In the example above, the function actually satisfies the boundary conditions (zero values on the boundary), so one could (legitimately) argue that this superior behavior for the iterated Brownian bridge kernel is to be expected.(2) Below we show a function with arbitrary boundary behavior and multiple local optima and again compare the behavior.

image
Figure 7a: The radial kernel samples somewhat inefficiently for the multimodal nature of this function.
image
Figure 7b: The iterated Brownian bridge kernel is better able to sample near the optimum by avoiding sampling near the boundary.

Conclusion

A common complaint of Bayesian optimization methods using Gaussian process models is their propensity to sample around the boundary and in the corners of the solution domain. Here we introduced a strategy using special positive definite kernels to sample away from the boundary by enforcing specific behavior at the boundary. In a future blog post we will delve into the derivation of these kernels and explore more complicated optimization problems.

Acknowledgements

Special thanks, again, to Greg Fasshauer for his continued collaboration on this project. Also, thanks to both Jian Wu and Jie Chen for their valued contributions to our SIAM minisymposium.

(0) The IBB kernels in this picture have been rescaled to match the scale of the Matérn kernels.

(1) This strategy is in contrast to the standard strategy of refitting the covariances using the current data but will help to visualize the described behavior. Also, the EI values can drop too low to be resolved in the plot, although the optimization is still conducted as is standard.

(2) There is an argument to be made here, which is often made in the approximation theory community, that using the correct interpolating basis is the appropriate behavior. However, in a truly black box setting, such knowledge is impractical.

Mike studies mathematical and statistical tools for interpolation and prediction. Prior to joining SigOpt, he spent time in the math and computer science division at Argonne National Laboratory and was a visiting assistant professor at the University of Colorado-Denver where he co-wrote a text on kernel-based approximation. Mike holds a PhD and MS in Applied Mathematics from Cornell and a BS in Applied Mathematics from Illinois Institute of Technology.
Mike McCourt, PhD