# Intro to Multicriteria Optimization

In this post we will discuss the topic of multicriteria optimization, when you need to optimize a model for more than a single metric, and how to use SigOpt to solve these problems.

\[ \def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\ggamma{\boldsymbol{\gamma}} \def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin} \]Here at SigOpt, we help clients to more quickly tune financial models and design industrial processes by providing access to an ensemble of Bayesian Optimization strategies through our REST API or web interface and associated Python, Java and R clients. Our fundamental goal is to find the maximum of a black box function (or metric)---one which takes as input several values (real numbers, integers or categories) and outputs 1 real number to be maximized, with no information about the underlying method for which the inputs interact to result in the output (which is why it is called a “black box”). Tasks as diverse as classifying images and betting on sports require optimization of a suitable metric black box function (which serves as a proxy for quality) to achieve the best performance; failure to do so can yield subpar results for even powerful tools such as deep neural networks.

Fortunately, many situations neatly present such a black box function: financial trading problems may leverage
backtesting, machine learning problems
may use cross-validation, industrial
projects may study product quality subject to tolerance. Some
situations, however, do not fit as neatly into the black box function setting described above, where a single metric is being optimized. Consider this
relatively standard situation of driving on the freeway^{(1)}: an increase in speed will cause a decrease in the time
until I reach my destination, but may also cause an increase in the cost by using more fuel. How, then, can
I choose a speed so as to simultaneously minimize time to the destination and minimize cost of the trip?

Such
a problem is a multicriteria optimization and
does not fit into the standard framework for black box optimization; applications including portfolio management, cancer treatment and network theory have already benefited from the use of multicriteria optimization. This
post will discuss the complexities of this problem and some strategies for manipulating it into a single criterion problem.

**Vector-valued functions**

Single criterion optimization problems are phrased in the following language:
\begin{align*}
\xopt = \argmin_{\xx\in\Omega} f(\xx), \qquad f:\Omega\to\RR,
\end{align*}
where \(\Omega\) is the set of possible inputs (think real numbers, integers and categories), \(\xx\) is an object
(just a vector if \(\Omega\) consists of only real numbers) describing the inputs and \(f\) is the function we hope
to minimize. The standard problem is
phrased as minimization, though \(f\) could instead be maximized by minimizing \(-f\). This optimization problem
can be efficiently solved using the Bayesian optimization tools in SigOpt.

The situation changes when \(f:\Omega\to\RR\) becomes the *vector-valued function* \(g:\Omega\to\RR^k\), that is,
the function we hope to optimize now maps inputs \(\xx\in\Omega\) to vectors \(g(\xx)\in\RR^k\). The reason
this is a more complicated situation is that an ordering of
vectors in \(\RR^k\) does not exist. Any two numbers \(\alpha,\beta\in\RR\) must satisfy exactly one
of \(\alpha<\beta\), \(\alpha>\beta\) or \(\alpha=\beta\), but such a requirement is not true for vectors \(\uu,\vv\in\RR^k\); how
would one order the vectors
\begin{align*}
\uu=\begin{pmatrix}1\\\\2\\\\3\end{pmatrix}, \qquad
\vv=\begin{pmatrix}2\\\\1\\\\3\end{pmatrix}, \qquad
\ww=\begin{pmatrix}3\\\\2\\\\1\end{pmatrix}?
\end{align*}
The inability to order these vectors produces ambiguity in any attempt to minimize \(g\): if \(\uu<\vv\) is ill-defined,
how could one ever declare that \(\uu=g(\xx)\) is the minimum value? To read in depth about problems of this
variety and their solution, see this text or
this text.

**Multicriteria Example**

We can return now to the problem posed earlier about simultaneously trying to minimize transit time and cost while driving on a freeway. In that setting, there is some function \(g\) of the form \begin{align*} g(\text{speed}) = \begin{pmatrix}\text{time to destination} \\ \text{cost of trip}\end{pmatrix} \qquad\longleftrightarrow\qquad g(\xx) = \uu = \begin{pmatrix}u_1 \\ u_2\end{pmatrix}. \end{align*} What does \(g\) look like? We do not know---that is the nature of a black box problem. We likely know certain attributes of \(g\), such as a decrease in speed probably causes an increase in time to destination, but the exact relationship may be difficult to write down. Experimentation has been conducted on this problem to empirically estimate \(g\) (see here or here). We use the 1997 data on "Fuel Economy by Speed" hosted at Oak Ridge to generate the figure below.

**Figure 1**:

*(left)*Mileage data from this report and a model used to make predictions.

*(right)*Relationship between the cost and time required to make a trip as determined below. Using the data from that report, we can create an approximation with which we can predict mileage at speeds that have not yet been observed.

While looking at this red curve, it is apparent that there are some results which could not possibly be considered acceptable minima: the speed of 10 miles/hour requires 10 hours and would cost more than \(\$14\) and both of those values can be improved upon by a better choice of speed. Such points, where all components of the objective vector \(\uu\) are greater than different observed results \(\vv\), are called dominated. Any results which are not dominated are considered Pareto optimal (also called Pareto efficient), and the entire set of such points form the Pareto frontier, which is displayed for this travel problem in the figure below.

**Figure 2**: A zoomed-in version of the right half of Figure 1, where the Pareto frontier (manifold of Pareto optimal results) is colored in blue. All other possible outcomes are dominated by at least one point on the Pareto frontier. The two stars represent two speeds which have the same cost; the yellow star’s travel time is greater than the blue star, thus it is dominated.

The blue portion of the curve in this figure contains points which are all considered Pareto optimal, which implies
that further decreasing one of the components of the objective would cause an increase in the other. Because
we have limited our speed to 75 miles/hour^{(1)} our required time cannot approach 0 (which should happen as we drive
increasingly fast^{(2)}) which is why the curve ends where it does. Orange represents speeds which are inferior;
the two stars have equal cost but unequal time.

**Tradeoffs**

It should be clear from Figure 2 why speeds that are not Pareto optimal cannot be considered solutions to the
multicriteria optimization problem: there is little reason to drive 38.5 miles/hour and arrive in 2.6 hours
(orange star) when one could drive 61.2 miles/hour and arrive one hour earlier at the same cost. For this
example, the interface between the orange and blue sections represents the minimum cost to reach the destination;
any attempt to decrease the time of transit would result in an increase in the cost of the trip. How then
can we decide which of the results on the blue curve are "the solution"?

In truth, there is no single way to decide which of the speeds that represent the blue curve are the
solution. Driving 52.4 miles/hour represents the minimum cost, driving 75 miles/hour represents the
minimum time, and any speed in between could be considered optimal depending on personal preference. The
tradeoffs between these two criteria can either be managed by some
supervisory decision maker (the
driver of the car in this example) or by merging the multiple criteria into some single criteria and phrasing
the problem as a standard optimization problem. Either
way, once this Pareto frontier has been identified (or, more
likely, approximated), no
further refinement of the solution the multicriteria optimization is possible.

These types of tradeoffs appear in the real world all the time, as shown in this popular comic strip among graduate students; specifically, image below depicts (in an entirely non-rigorous way) multiple factors competing for the lunch dollars of a graduate student. The point below denoted "Optimal Mediocrity" is dependent on the degree to which "Gross", "Effort" and "Cost" are valued by the decision maker. It may, in fact, be very common for students to more greatly value "Cost" and prefer the Instant Noodles.

**Figure 3**: The tradeoffs under consideration for a graduate student at lunchtime. Graphic courtesy of www.phdcomics.com.

**Linear scalarization**

Each research field has its own domain experts that can serve as decision makers, and as such it is difficult to
discuss the decision maker strategy in generalities. The method of forming a scalar objective
function \(f:\Omega\to\RR\) from the vector objective function \(g:\Omega\to\RR^k\) is more universal, so we
restrict our discussion to this *linear scalarization*
methodology. Scalarization (discussed
nicely in chapter 2 of this text) is the process by
which some transition function \(T:\RR^k\to\RR\) is applied to the multiple objectives from \(g\) to rephrase the
multicriteria problem as
\begin{align*}
\xopt = \argmin_{\xx\in\Omega} T(g(\xx)).
\end{align*}
Unfortunately, all this does is transition the burden of decision making between the components
of \(g\) from some expert decision maker to the function \(T\); choosing an appropriate function \(T\) is important
because a poor choice can result in solutions to this scalar optimization problem which are not Pareto optimal
in the multicriteria problem.

A variety of scalarization strategies have been
developed for different circumstances, but the simplest among them is a weighted-sum strategy involving a
nonnegative weighted average of the components of \(g\): for weights \(\gamma_1,\ldots,\gamma_k>0\) such that
\(\gamma_1+\ldots+\gamma_k=1\), we define
\begin{align*}
T_\ggamma(g(\xx)) = T_\ggamma\begin{pmatrix}g_1(\xx)\\\\\vdots\\\\g_k(\xx)\end{pmatrix} =
\gamma_1 g_1(\xx) + \ldots + \gamma_k g_k(\xx).
\end{align*}
One benefit of this strategy is that any solution to this problem must be Pareto optimal^{(3)}. Another benefit
is its easy interpretability: larger \(\gamma\) values correspond to components which are more important to
minimize. Within the context of an industrial process, it would probably still take an expert to choose an
appropriate balance between these \(\gamma\) values, but this strategy provides a strategy to apply certain preferences
and the benefit of Pareto optimality.

Referring back to the car driving problem, \(\gamma_1\) and \(\gamma_2\) values could be chosen to scalarize the
vector-valued objective? Because \(\gamma_1+\gamma_2=1\), we actually need only choose one parameter, which
we will just call \(\gamma\), to create the scalar objective function
\begin{align*}
T_\gamma(g(\text{speed})) = \gamma (\text{time to destination}) + (1-\gamma) (\text{cost of trip}).
\end{align*}
So, what impact does the choice of \(\gamma\) have on the objective function? The figure below shows
the \(T_\gamma(g)\) objective for several choices of \(\gamma\), as well as the optimal speed as a function
of \(\gamma\).

**Figure 4**:

*(left)*The scalarized form of the objective function for various \(\gamma\) values.

*(right)*As explained by the equation above, smaller \(\gamma\) values place more emphasis on fuel efficiency and larger values prefer a shorter duration. Because we assume no one will violate traffic laws

^{(1)}, choosing \(\gamma>.91\) will always return 75 miles/hour.

Choosing \(\gamma\) large states a preference towards less time, and choosing \(\gamma\) small states a preference
towards less cost. The actual relationship, as depicted in the right half of Figure 4, is complicated by
the fact that the the components of the objective \(g\) do not have the same units and on the Pareto frontier they
are not on the same scale (the topic of scale is mentioned
in this post): cost
varies between \(\$9-\$14\) whereas time varies between 1-2 hours. From a physical sense, even adding values with
different units is inconsistent, though we skirt this issue by stating the \(T_\gamma(g)\) optimization problem in a
fully mathematical (read unphysical) setting. Given that, the role that \(\gamma\) plays is very much a function
of those units, which is why there is a 10 miles/hour range for \(\gamma\in[0, .8]\) and another 10 miles/hour range
for \(\gamma\in[.8, .9]\). Problems phrased in this way can be readily optimized by tools like SigOpt.

In a perfect world, all the components of a problem would be on exactly the same scale, allowing for an easier
prediction of the role that \(\ggamma\) plays. Below is a figure which shows the same optimal solution relationship,
but for slightly different versions of the \(g\) function. We can clearly see that rescaled versions of these
functions have a significant impact on the resulting optimum for a given \(\gamma\). On the left, we see, perhaps,
a more consistent transition as \(\gamma\) increases; this is the result of rescaling the objective components to
\([0,1]\), which is only possible for this simple problem because we know the exact values of the Pareto frontier. On
the right, we see an extreme transition period, caused by converting the currency of the gasoline purchase to
Chinese renminbi (multiplying \(g_2\) by 6.67 RMB/USD).

**Figure 5**:

*(left)*Objective values from \(g\) are scaled to \([0,1]\).

*(right)*\(g_2\) is multiplied by 6.67.

From a purely mathematical perspective, it does not matter what rescaling, if any, is done to this problem: for
all the values on the Pareto frontier, there is at least one \(\gamma\) for which it is the solution of the scalarized
problem^{(4)}. From a practical standpoint, it can be significant, though, because some expert must make a decision
regarding the appropriate \(\gamma\), and that decision is much easier in the left half of Figure 5 (where a small
change in \(\gamma\) has a small change in optimal speed) than in the right half of Figure 5 (where the impact of
a small change in \(\gamma\) has an unpredictable impact on the optimal speed).

**\(\epsilon\)-constraint scalarization**

In some circumstances, it may be the case that one component of \(g\), say the \(i\)th component \(g_i\),
is more important than the rest; as long as the other \(k-1\) components are "satisfactory" only the \(g_i\) component
needs to be minimized. Mathematically, this becomes a
*constrained* scalar optimization problem,
\begin{align*}
\xopt = \argmin_{\xx\in\cX} g_i(\xx),
\quad \newline\mbox{ subject to }\quad
g_1(\xx)\leq\epsilon_1,\;\ldots,\;g_{i-1}(\xx)\leq\epsilon_{i-1},\; g_{i+1}(\xx)\leq\epsilon_{i+1},\;\ldots,\;g_k(\xx)\leq\epsilon_k.
\end{align*}
This notation is somewhat dense, but can be interpreted more easily in the
context of the example above; suppose the driver has only \(\$11\) available and must reach the destination at a
cost below that. In that setting, \(\epsilon_2=11\) and the multicriteria optimization problem becomes the scalar
problem
\begin{align*}
\min_{0\leq\text{speed}\leq75}\; \text{time to destination}(\text{speed}),
\qquad\newline\mbox{ subject to }\qquad \text{cost of trip}(\text{speed})\leq 11.
\end{align*}
Graphically this can be seen in the figure below. Problems phrased in this way can be optimized by tools like SigOpt by taking advantage of reporting “failure” observations when criteria are not met.

**Figure 6**:

*(left)*If we constrain the cost to less than \(\$11\) and concern ourselves only with minimizing the time to destination, the red star represents the solution.

*(right)*If, instead, we simply must arrive in less than 90 minutes and want to incur the minimum cost, the solution is the red star.

**Conclusion**

Optimization is what we do at SigOpt, but not all optimization problems are created equally. Multicriteria
optimization problems, which involve the simultaneous optimization of multiple objective functions, require
additional specification to solve them uniquely. This post shows an example of such a problem, and walks
through a strategy for converting it into a standard problem which can be efficiently optimized using the
Bayesian optimization tools within SigOpt. Stay tuned for more updates on using SigOpt to tune these problems and others as we continue to expand the capabilities of our optimization platform.

**Special thanks**

I would like to extend my gratitude to Devon Sigler, a top-tier mathematician and multicriteria expert at the University of Colorado Denver, for his help refining some of the descriptions in this post and clearing up my own understanding of the implications of nonconvexity in weighted sum scalarization.

(1) SigOpt recommends following posted safety laws while driving, including driving under the speed limit.

(2) Of course limited by the speed of light, and any associated relativistic implications.

(3) The requirement that the \(\gamma\) values be strictly greater than zero is needed to guarantee Pareto optimality. If, instead, only weak Pareto optimality is required, it is possible to allow some \(\gamma\) values to be 0.

(4) This is only true if all objective functions are convex and the feasible region is convex. If this is not the case there can be points on the Pareto frontier that no value of gamma will find. This is one of the primary drawbacks for the weighted sum scalarization method. This is, however, not a problem for the \(\epsilon\)-constraint method---there are \(\epsilon\) choices that can be used to find each point on the Pareto frontier.