This is the third of three blog posts during which we explore the concept of uncertainty - or noise - and its implications for Bayesian optimization. This is part of our series of blog content on research that informs our product and methodologies.
In previous posts we have discussed the problems of having uncertainty when building models and the impact of uncertainty on the efficiency of Bayesian optimization. In this post, we follow up with the topic of uncertainty in the context of multimetric optimization; specifically, we discuss how to interpret the outcome of such problems when working with uncertainty.
Balancing Competing Metrics
Optimization with multiple metrics, also known as multimetric, multiobjective or multicriteria optimization, requires searching the configuration space to balance 2 or more metrics. When doing so, except in trivial circumstances, improvements in one metric are likely to yield losses in the other metric. Analyses which involve such competing metrics include: statistical modeling with a balance between inference precision and required runtime, deep learning with a balance between number of layers and neural network bias, or in manufacturing with a balance between cost and CO2 emissions.
Given that it is likely that a single solution does not exist(1); instead, we are interested in that balance between the metrics: a set of configurations known as the Efficient Pareto Frontier. For example, as discussed in a helpful post, let us say that we want to buy a new car while balancing the competing objectives of purchasing an affordable but fuel efficient car.
While looking for possible options, we come across a big SUV and a small hybrid car, so we compare both: The hybrid car is cheaper and more fuel efficient, so ... what is the point of buying a SUV then? Based on our preferences, the SUV is clearly dominated by the hybrid. Only efficient options, those not dominated by others, appear on the efficient frontier.
Let us consider a more complicated problem inspired by a previous SigOpt blog post: minimize the fuel consumption and the time to arrive of the daily commute. Alongside these metrics, we need to define the possible configurations, a set of decisions that will affect the output values of the metrics. In this example, the configuration space is: the speed of the cruise control, at what time are we heading to work and which route are we taking.
In general, there will be multiple, and perhaps infinitely many, efficient configurations. The solution to a multimetric optimization(2) problem is, theoretically, all of these configurations. In practical situations, we consider an approximation to be any efficient configurations observed during the optimization process.
Figure 2: Explanation of the elements of a multimetric outcome. The figure represents the values space, where each axis is the value of one of the metrics. From all the observed values (black), we identify the efficient Pareto frontier (red) and the dominated values (blue).
The quality of an approximate frontier might be judged by how effectively it represents the true frontier. It also may be judged by whether configurations deemed as efficient actually do dominate (in expectation) other configurations considered during the optimization. In the following sections we will analyze the quality of the approximate frontier in the presence of uncertainty.
In the previous section, we have assumed that we can access the exact calculations for both metrics. In reality, there is some uncertainty associated with the observed metric values (discussed more in the earlier post). For the ease of explanation, we assume that the uncertainty comes from the imprecision associated with the measurements of these metrics; we may call this noise.
The amount of noise will affect how much information present in the noiseless values we lose. In the next figure we represent that information, not only as a whole, but in particular in our understanding about the efficient Pareto frontier.
Figure 3: Increasing levels of noise affect the efficient Pareto frontier. The different routes are displayed with different colors. The overall efficient pareto frontier is highlighted in black.
Without noise, we can see how the routes are nicely ordered over the efficient Pareto frontier, depending on how efficient are on time and fuel consumption. That understanding is obscured as the noise increases and values are increasingly misleading(3).
Understanding the Impact of Noise
The first impact of noise is felt during the multimetric optimization itself. Noise obscures the true value of a function, complicating a correct exploration of the efficient Pareto frontier during the optimization. As mentioned in the previous post, it is beneficial to select a suitable optimization strategy in the presence of noise.
The second impact comes from interpreting the outcome of a multimetric problem. Because of the observation noise, we should not trust the observed values to calculate the efficient Pareto frontier. Instead we should take advantage of the models that we build during the optimization to also compute the efficient Pareto frontier from the results of the optimization.
Ideally, we want to identify the same efficient configurations as if we have access to the noiseless observations. To understand the impact of noise on the quality of the outcome, we can compare the noiseless outcome against one possible set of noisy observations.
Figure 4: In this figure we study the effect of increasingly large levels of noise in contrast to the noiseless frontier. All red symbols are part of the original noiseless Pareto frontier. As we introduce more noise, some of the original Pareto points become dominated (×) and some of the original dominated points become Pareto frontier (●).
The noisiest outcome is only able to correctly identify a small subset of observed results as being efficient in the absence of noise. The Pareto frontier relies on the relative values of the different points to decide whether a point is efficient or dominated. Therefore, introducing noise into the observations means that such relationships between values are no longer preserved and, thus, the resulting Pareto frontier quality is deteriorated.
The figure below shows that by studying the expected results from our models we can better identify configurations as efficient or dominated even when results were observed in the presence of noise.
Figure 5: This figure displays models starting from those that are extremely active and may overfit and closely recreate the observed noisy results and ending at that are less subject to noise.
Probabilistic Pareto Frontier
We have seen how accounting for noise by building models(4) allow us to manage, to an extent, the obscuring effect of noise. In this section we will try to improve and enhance the quality of the outcome by taking a deeper look at the model estimations.
The first thing to note is that, in an attempt to provide generalization, model predictions are necessarily imprecise. Such predictions usually contain a variance around the expected value; this variance will be a function of the amount of information available as well as how it is interpreted.
After acknowledging the existence of estimate variance, we want to account for it in the multimetric outcome. The problem is that, in doing so, the deterministic Pareto frontier no longer makes sense. The variance of the estimated values yields a lack of certainty as to whether values are efficient. The logical response is to create a probabilistic interpretation of the Pareto frontier and, as a result, an estimate of the probability of a point being in the Pareto frontier.
The logical response is to create a probabilistic interpretation of the Pareto frontier and, as a result, an estimate of the probability of a point being in the Pareto frontier. This probability can be approximated by drawing joint samples(5) from the posterior distribution at all the observed points.
Finally, by repeatedly drawing a sample at all points and computing its efficient Pareto frontier, we can keep track of the number of times a point was in the efficient Pareto frontier to approximate the probability. This allow us to incorporate the probability into the multimetric outcome.
Figure 6: Probabilistic outcome of the multimetric optimization. We are taking into account the variance about the estimate to show the probability of a point being Pareto frontier using different colors.
Comparing this outcome to the noiseless outcome is harder than in the previous sections, as we are comparing a probabilistic interpretation against a deterministic interpretation. Nonetheless, we can show that the assigned probabilities make sense relative to the noiseless outcome.
Figure 7: Visual representation of how higher probabilities are likely given to points that appear on the noiseless Pareto frontier and lower probabilities are likely assigned to points that, in the noiseless setting, were dominated. The x axis is an indexing of the points in descending probability of efficiency.
Another way of comparing it is by transforming the problem back to a deterministic setting by using a threshold: observations with probabilities over the threshold will be classified as the Pareto efficient frontier and those with probabilities under the threshold will be dominated.
Figure 8: Precision of the noisy observations, the estimated observations and the different threshold of the probabilistic interpretation. In this problem, precision is the proportion of points identified as efficient that are efficient in the noiseless setting.
Here, the probability threshold is a free parameter that an expert could use to help tune the precision of the efficient/dominated classification process. A very conservative choice of a high threshold will yield very few points being incorrectly identified as efficient (but some which actually are efficient being identified as dominated). A lower threshold will correctly identify more points as efficient, but at the cost of likely misclassifying some dominated points as efficient.
We have discussed how, in the presence of uncertainty, observed noisy values can produce a misleading understanding of the possible efficient configurations. This prevents a complete identification of the efficient frontier, which, in turn, hinders any potential decision making which relies on an understanding of efficient options.
To combat this complexity, we have investigated the use of models to better inform our estimate of the efficient frontier (and subsequent identification of efficient configurations). In our demonstrations above, we have used the same models that powered our optimization to estimate expected outcomes and help predict efficient configurations; in practice, other decisions could also be made using domain knowledge for a specific application.
Additionally, we have shown a probabilistic Pareto frontier interpretation provides a more nuanced understanding of efficient configurations. This is not only a more natural way of analyzing outcomes under uncertainty, but it also helps to make effective decisions by providing more than just binary information about the efficiency of observed configurations.
(1) A single optima for the multimetric optimization would only exists if both metrics had the same configuration as their optima.
(2) The actual process of conducting the optimization efficiently is still an open research problem. See, e.g., Khan et al. 2002,, Laumanns and Ocenasek 2002, Kollat et al. 2008 or Hernández-Lobato et al. 2016, among others.
(3) Is misleading in a sense that we can get values much higher or lower than realistically possible. This is due to the measurement variance being high. This not only affects the multimetric outcome but also the perception of the expert who has to analyze the outcome.
(4) In multimetric settings, we usually have a different model for each metric (or a joint model over all metrics). This allows us to account for differences in the scale of the values and the amount of noise.
(5) It is important to note that, as the efficient Pareto frontier relies on the values of all points, it is less accurate to draw samples from the posterior distribution independently at each observed point. Instead, we should draw samples from the posterior distribution jointly of all the observed points.