Building a Better Mousetrap via Multicriteria Bayesian Optimization
In an earlier post, we introduced the topic of multicriteria optimization and some mechanisms for solving multicriteria problems (also called multiobjective problems) using the SigOpt optimization tool; SigOpt provides a RESTful API to solve scalar black-box optimization problems for customers from fields such as industrial design and image processing. The earlier introduction discussed a somewhat simple problem of choosing the driving speed on a trip subject to concerns about minimizing both time in transit and cost of fuel. In this post we want to analyze a more complex situation in which the parameters of a given model produce a random output, and our multicriteria problem involves maximizing the mean while minimizing the variance of that random variable.
The standard example in literature of simultaneously considering the mean behavior of a model/process and the variance is in financial portfolio management. Generally, investors want to maximize expected gain while minimizing risk (which exists in the form of variance) which motivated the development of the Nobel prize winning theory of mean-variance analysis. The original work and a 2014 retrospective by the same author (as well as modern texts such as here and here) serve as helpful introductions to the theory of defining (even implicitly) a utility function and, potentially, an associated quadratic (mean-variance) approximation. Under some conditions (discussed in those references) an expert’s selection of a portfolio from the mean-variance Pareto frontier should approximately optimize the utility function.
Variations on a theme
The motivating goal of mean-variance analysis involves both wanting good performance (high portfolio return) and wanting to avoid bad performance (high risk and potential for loss). These are certainly related, but are not necessarily the same thing.
Maze solver description
Our maze solver strategy is the very simple wall follower strategy, which will, admittedly, only work on simply connected mazes (no loops, no levels). Given that, it is quite easy to implement and has predictable (non-random) behavior which helps push the entirety of the randomness to the maze generation algorithm.
For those of you interested, you can see the implementation of the maze construction and solution in our examples Github repo.
The strategy involves the mouse continually following the wall on the right all the way through the maze until reaching the end. If the mouse can turn right, it does so, otherwise it moves forward; if both those paths are blocked it attempts to turn left. If all three directions are blocked, the mouse has reached a dead end and reverses course, continuing to hug the right wall to guide it forward. There is an outstanding explanation and graphic of this wall follower strategy available here which I will not try to supercede. The figure below, instead, shows why the wall follower strategy succeeds for simply connected mazes, but may fail for mazes which have their exit inside a loop.
Maze generator description
It is quite tempting for myself, as a research-oriented person, to see the problem of designing a maze as an opportunity to spend several days devising diabolical labyrinths under the cover of “developing content marketing”. I restrained, however, if only because so many great resources are already available on the internet. In particular, Wikipedia already had available a perfectly acceptable starting point involving a random depth-first search. Furthermore, the resulting maze structure guarantees that the failure to solve observed in Figure 1 cannot be encountered.
The only shortcoming with the code already available is that there were no tunable parameters. We needed to introduce a mechanism for making the mazes harder or easier for the mouse to solve; we did this by varying the relative probability of the depth-first search choosing each of the possible directions with which it could move, after eliminating those directions which it has already traversed. Essentially, at each step of the maze generation, there is a relative probability of punching through a wall. The figure below shows one such randomly generated maze, which has no preference for any direction.
As we can see there is a significant, and seemingly unpredictable, impact from changing the parameters. Maximizing the mean (or conceivably, median) of this distribution while minimizing the variance (or maybe interquartile range) of this mouse’s speed will be non-intuitive. This black box style of problem is exactly what SigOpt is designed to handle … except for the fact that SigOpt does not (as of August 2016) natively support multicriteria optimization (though, thanks to the support of our investors, we hopefully will soon).
Devising a solution strategy
So, how can we use scalar Bayesian optimization to provide a solution to this problem? An earlier post introducing the topic of multicriteria optimization provided possible strategies, which we implement here. Depending on the situation, there may be multiple ways to interpret “solution” for a multicriteria optimization problem:
- Some scalar function of the multicriteria components could be formed, which is then optimized to define the “solution”. This would require only a single scalar optimization, but would require prior understanding as to how the scalarization balances the mean and variance.
- An approximation to the Pareto frontier can be formed by solving several scalarized versions of the problem. A selection from these scalarized solutions can then be made by a domain expert to determine which is most appropriate given the level of risk aversion. This strategy is the focus of our work today.
As a baseline, the figure below was generated using a 1000 point low discrepancy grid sampling over the 3D parameter space (varying left, up, right construction probabilities while holding down fixed): it shows the spread of mean/variance mouse solve times estimated over 100 randomly generated 30x30 mazes.
There are a lot of things that can be said about this figure, and a lot of interesting research questions (e. g., why is there an empty region in the bottom left?), but we will restrain from a thorough investigation in the interests of following a practicable workflow; the amount of work required to create this picture is significantly more than we hope to invest to find a solution. The remainder of this section will detail our attempts to do this efficiently.
The shape of the feasible set indicates that the metrics of interest may not be convex; this presents limitations to the use of weighted sum scalarization in trying to define the Pareto frontier. Of course, we likely would not know this at the start of the optimization process, so we will proceed as though we believe the component functions to be convex and see how the results turn out.
Our first strategy is to still use the weighted sum method (even though it could be troubled) to approximate the Pareto frontier; recall that, if the component functions are convex, then a weighted sum (which sums to 1) will have an optimum on the Pareto frontier. We will consider 15 weighted sum experiments with weights evenly spaced in [.05, .95]. Mean and variance values for suggested left, up and right turn parameters will be estimated over 100 maze constructions. The figure below shows the result of such optimizations, as conducted using SigOpt.
To work around any potential lack of convexity in the component functions, we can instead phrase the problem as an epsilon-constrained scalar optimization, similarly to what was discussed in our earlier multicriteria post. Each of these scalar optimization problems can be phrased as minimizing only the standard deviation, subject to the mean being greater than a pre-chosen constraint. The image below was created with 10 constraint-based experiments, using mean constraints evenly spaced in [.95, 1.2].
Choosing a solution
So, after all of this, it is important to remember that simply forming the Pareto frontier does not solve the practical issue of designing the product: an expert must choose the “best” solution from the approximate frontier. Because we are not maze experts, we simply present some of the possible choices in the figure below and allow the reader to make their own judgment. In a more practical/industrial setting, the balance imposed by the expert may take the name utility or risk aversion.
In many settings there is a need to balance performance against robustness: Toyota may want to increase fuel efficiency but it cannot produce vehicles which gain 20% more efficiency at the cost of failing to start 1 out of every 5 times they are used. This tradeoff between improving average behavior and introducing a greater potential for bad behavior is the crux of this post, which we have addressed in the context of making a better mouse-stumping maze.
We have used SigOpt to identify multiple approximately Pareto efficient maze configurations from which an expert (maze designer, not an expert mouse) could select the maze which best balances difficulty and consistency. For these multicriteria problems, the need to efficiently conduct black box optimization is amplified because of the need to generate multiple solutions on the Pareto frontier. Using Bayesian optimization has a multiplicative benefit because each point on the Pareto frontier is generated at less experimental cost than with, e. g., grid search.
Again, any interested readers can feel free to experiment on their own using our implementation of the maze construction and solution available in our examples Github repo.
Again, I would like to extend my thanks to Devon Sigler, an expert in multicriteria optimization who will shortly be finishing his Ph.D. He has helped introduce me to the wide range of literature that discusses multicriteria optimization and have provided valuable insights regarding the theoretical implications of popular solution strategies.
(1) I have been told that my posts are too serious and informative and should be more fun. I hope you find this post to be more fun.
(2) SigOpt does not advocate for gambling. If you choose to gamble, please gamble responsibly. Gambling at a game where you automatically lose 10% of your bet with no potential to win is an example of irresponsible gambling.
(3) Normally, this is where I would interject with my own discussion of the competing interests of theoretical statistics/mathematics and practical applications, given the compelling history of Genichi Taguchi, but Wikipedia has already beaten me to it. Kudos, again, to the outstanding contributions on Wikipedia and those of you out there doing the contributing.
(4) In our experiments, we allow ourselves 3 tunable parameters: the relative probabilities of moving left, up and right. We use the term “relative probability” to mean relative to the probability of moving down, which is always fixed at 1; the requirement that all possible probabilities sum to 1 is why there are only 3 tunable parameters. It is entirely possible (and more natural) to phrase this situation as having 4 free parameters, but in doing so there will be repeated behavior throughout the 4D parameter space. The random mazes generated with (1, 2, 3, 4) would look the same as (2, 4, 6, 8) or (.1, .2, .3, .4). To avoid this degeneracy (where a 4D problem actually only has 3 dimensions of complexity) we fix the final element in the vector to 1. This is not the only way to resolve this (see, e. g., mixture designs) but it is the simplest, in our opinion.
(5) It is more common to read about minimizing the variance than the standard deviation (as the name mean-variance analysis suggests) but here we deal with the standard deviation because it has the same units as mean. This helps to put both components of the multicriteria problem on the same scale, which can be helpful when identifying a solution.
(6) Because the mouse follows the wall follower strategy, it can never reach a location more than twice (once coming and once going if that path led to a dead end). That is why the maximum is 2. The minimum is conceivably 0 if the exit and start were the same location, but functionally a number greater than 0 because we place the exit at the other side of the maze.
(7) Since latter experiments (in the sequence of weighted sum problems) have the benefit of earlier data, they should produce a better solution to the scalarized optimization problem. Although we did not here, it might be wise to run earlier experiments with a few extra observations to provide them with more opportunity for the optimization algorithm to converge.