Breaking Free of the Grid

SigOpt is a software as a service optimization platform that accelerates modeling and industrial processes. Scientists across industries face challenging optimization problems. Data scientists building predictive models to classify images, financial analysts building trading models, and industrial engineers building complex systems like airplane wings all need to optimize the performance of a model or process that has numerous input parameters and is expensive to test. In this post, we review the history of some of the tools implemented within SigOpt, and then we discuss the original solution to this black box optimization problem, known as full factorial experiments or grid search.  Finally, we compare both naive and intelligent grid-based strategies to the latest advances in Bayesian optimization, delivered via the SigOpt platform.

The Mathematical and Statistical Foundations of SigOpt

The recent advances that power SigOpt have built upon generations of prior research: most substantially, reproducing kernel Hilbert space theory began more than 100 years, although one could argue that the theory was not fully developed until roughly 60 years ago (see also this classic text, starting at chapter III.4).  The separate development of the topic within spatial statistics can be dated back precisely to 1951 though the formal theory involving Gaussian processes was not developed until at least the mid-1960s, and the connection between the two fields was later realized in 1970.  Other parts of our blog have discussed interpreting Gaussian processes (here, here and here) and, for the truly interested reader, chapter 1 of my book provides a more comprehensive retrospective on the development of this theory.

Given this long mathematical and statistical history, it may be somewhat surprising that the earliest well-known instances of the use of Gaussian processes in a Bayesian optimization framework were documented within the USSR by this early 1970s conference report.  In fact, the real adoption of Gaussian process-backed optimization (also called kriging metamodel optimization) seems to have only begun subsequent to the 1998 development of expected improvement (discussed more simply in slides I picked up from grad school).  Since then, there has been an explosion of research (including outside Gaussian processes) which SigOpt is continually surveying to update our own software and provide the best possible experience for our customers in an easy-to-use API.  Our posters at ICML 2016 workshops discuss our hierarchical comparison methodology for simultaneously analyzing multiple optimization techniques within our quality assurance framework.

Grid search as an optimization tool

Of course, prior to these new developments in efficient optimization, researchers still had black box functions that needed to be optimized---planes needed to fly before computers even existed.  This raises the interesting question of “How were these metrics maximized before Bayesian optimization existed?”  In a word, the answer is slowly.  Historically, experts would spend significant time analyzing their models so as to manually identify the, for example, financial investment strategy which is predicted to maximize their portfolio’s value.  Each time the strategy was changed, the expert would be forced to reinvest this time, which was a waste of that expertise.  This 2004 article belabored the cost of hand-tuning in the context of robotics, and this 2014 retrospective recalls the earlier belief that hand-tuning was required for good numerical linear algebra computational performance.

More formulaic methods for global black box optimization were developed prior to sequential kriging optimization to help remove this burden from the experts.  The simplest and most logical among these methods is grid search: try small changes in each of the parameters of the model as defined using a regular grid.  For a financial investment scenario, one could imagine a very simple portfolio where 100 units of asset A are purchased and x units of asset B are sold to appropriately hedge the position.  Using grid search to determine the value of x which maximized the worst case value of the portfolio would consist of defining a domain for x, a desired number of points T at which to compute the expected value, and then computing the expected value at T equally spaced points within that domain, as shown in the figure below.

Figure 1: (left) The median and inner 90th percentile range for the value of the portfolio.  (right) Grid search using T=10 observations to maximize the worst case behavior of the portfolio.

Grid search does an acceptable job identifying the maximum of the red curve using T=10 investment predictions (each of which costs some amount of time for an expert or model to generate).  But, unfortunately, as more dimensions (or parameters or degrees of freedom) are introduced to model a more complex process, the resulting metric becomes too complicated to feasibly implement a grid search.  Sampling with a grid using T points in d dimensions requires Td total points, which quickly becomes impracticable as suggested in the figure below.

Figure 2: (from left to right) For 1, 2 and 3 dimensions, a grid search with T=10 points per dimension would require 10, 100 and 1000 total metric evaluations.  Quickly, this becomes too costly for all but the simplest of underlying models.

Modifications to grid search

We have identified grid search as a classical strategy for maximizing black box functions, but also suggested that for larger problems it is unreasonably costly.  Below, we discuss ways in which grid search can be changed to allow it to work in higher dimensional settings.

Random subset grid search

The simplest way to simulate a 10000 point grid search with only 100 samples would be to create the full 10000 point grid and only sample from a subset of those points.  To do so, we will randomly sample, without replacement, 100 of the 10000 points and call that a random subset grid search.  This is similar to a standard random search(1), except rather than sampling from the continuous uniform distribution we will sample from the discrete distribution defined by the full grid.  The figure below shows the benefit of randomly sampling from a dense grid when the dense grid contains too many points as to be feasibly computed.

Figure 3: (left) The full grid search with T=10 fills very little of the space in the first 100 points. (right) 100 points are randomly chosen from the grid, which provides a better spread of values throughout the domain.

Iterative 1D grid slices

This concept fixes all dimensions except one, and then conducts a 1D grid search in that dimension.  The best result for that dimension is then fixed and a 1D grid search is conducted over a different dimension.  We continue to cycle through 1D grid searches one dimension at a time until 100 function evaluations have been exhausted.  The figure below shows one possible version of this process.

Figure 4: (left) For this example function, the maximum in dark blue is not near the first 100 points of the full grid search. (right) 1D slices through the domain provide a faster route to the maximum of this sample function.(2)

There are some design decisions which can have an important (but unpredictable) impact on the outcome of this marginal optimization strategy.  The first decision that should be made is in what order the dimensions should be marginally optimized, or even if they should be optimized in a fixed (rather than adaptive, or even random) order.  The next decision involves the T value to be used for each dimension’s 1D grid search, and whether that should be constant across all dimensions.  Our experiments below involve T=10 in all dimensions and the initial values for all dimensions are chosen randomly from the 10000 point grid.  The order through the dimensions is also chosen randomly.

Low discrepancy grid search

>The computational limitations of uniform grids (or, more generally, tensor product grids) have long been a thorn in the side of the numerical integration community.  Quasi-Monte Carlo methods have been developed to facilitate integration in higher dimensions with faster convergence, closer to standard quadrature methods than true Monte Carlo methods (which resemble random search).  These quasi-Monte Carlo methods require the development of low discrepancy points (such as the Sobol and Halton sequences, latin hypercube sampling and lattice rules) which are very effective at sampling in a high dimensional space, which is shown in the figure below.

Figure 5: (left) The standard uniform grid search with T=5 has difficulty recognizing certain relationships between dimensions as evidenced by the inconsistent gaps between points when viewed at certain angles. (right) In contrast, the (2, 3, 5) Halton sequence(3) with the same 125 points is able to keep a more consistent density at all angles, which suggests a better understanding of the relationship between dimensions.

Numerical experimentation

Here we try maximizing a number of 4D functions which are available in the SigOpt evalset(4).  We compare the performance on a T=10 full grid (with a total of 10000 function evaluations) to each of the modified versions described above using only 100 function evaluations; we also provide a comparison to SigOpt’s optimization service.  The figures below contain the best seen traces (described in an earlier blog post) which depict the best function value observed after a specific number of function evaluations.

Figure 6: For these two functions, we see roughly similar behavior for the random subsample and Halton low discrepancy strategies.  The iterative 1D slice strategy performs better than them, including potentially reaching the full grid performance, as seen on the left. (left) The Hartmann4 function. (right) The Shekel (m=7) function.
Figure 7: All three modifications have significantly different behavior for these two functions. On the left, the random subsampling is distinctly better than Halton, whereas on the right the Halton method can surpass even the full 10000 point grid search solution.  (left) The negative log10 of the Gear function. (right) The Michalewicz function.

Conclusion

In higher dimensional problems, conducting a full grid search is infeasible; we have proposed three modifications to that simple approach to parameter tuning which allow for some approximation to grid search using fewer model evaluations.  None of these three strategies are always the best, although the iterative 1D slice strategy seems most likely to reproduce the quality of the full grid using (in the experiments above) only 1% of the full grid evaluations.  Take caution when generalizing these results, since these are only 4 examples out of the entire possible space of functions that could be considered, but the graphs above suggest these modifications could allow for viable results using many fewer evaluations than the full grid.

Another important point to be reminded of here is the superior performance of Bayesian optimization through SigOpt on each of these functions.  Using that same limit of 100 function evaluations, these advanced mathematical methods are able to outperform not only all of the modified strategies but even the full grid search requiring 10000 function evaluations.  For our finance customers, this improvement could correspond to finding a more profitable or less risky trading/investment strategy faster.  For our machine learning customers, this could correspond to models with higher predictive capacity or more robustness with less compute cost.  For any clients who need the best possible performance from their models with as little training time as possible, SigOpt provides a RESTful API, accessible through our Python, Java or R clients, which conducts this optimization using an ensemble of Bayesian methods, so that domain experts can focus their time and energy utilizing their expertise rather than tuning models.

(1) We reference our blog post about random search because we believe that to be the more common definition of random search within model tuning literature.  The Wikipediaentry more closely hews towards a randomized trust region method, which can be expanded to a simulated annealing strategy to deal with nonconvex functions that are cheap to evaluate or to conduct MCMC sampling from a multimodal function.

(2) For simplicity in generating this plot, we ignore the fact that during each 1D slice, some number of desired function values may already be present from previous slices.  Correctly storing and managing those values would reduce the cost of this method.

(3) Some literature uses Halton sequences starting at the origin, but this example skipped that first point.  The (2, 3, 5) Halton sequence refers to the prime numbers used to generate the sequence in each of the dimensions.

(4) https://www.github.com/sigopt/evalset is a Python implementation of real-valued functions that SigOpt uses to conduct regression testing as we try to deploy new optimization strategies.  Some of these functions were used for testing in our ICML AutoML submission.  The functions provided in the evalset are meant to be minimized, so these experiments instead maximize the inverse of the functions.

Mike studies mathematical and statistical tools for interpolation and prediction. Prior to joining SigOpt, he spent time in the math and computer science division at Argonne National Laboratory and was a visiting assistant professor at the University of Colorado-Denver where he co-wrote a text on kernel-based approximation. Mike holds a PhD and MS in Applied Mathematics from Cornell and a BS in Applied Mathematics from Illinois Institute of Technology.
Mike McCourt, PhD