Approximation of Data

In previous posts on SigOpt Fundamentals, we introduced the concept of Gaussian processes and covariance kernels.  Using these and other tools, SigOpt helps companies efficiently optimize their critical metrics, including revenue in A/B testsaccuracy of machine learning models, and the output of complex simulations.  To do this for problems which can have many variables/parameters, we suggest a sequence of experiments for a company to conduct.  These experiments create data with as many dimensions as the company has parameters, and one of our goals at SigOpt is to use this observed data to predict unobserved values so as to determine the optimal values for these critical metrics.  We call this process of taking data and making predictions approximation.

The need for approximation may not immediately be obvious; if you are driving and you need to know your speed, you probably just check the speedometer rather than approximating based on the last five times you looked down.  Indeed, in a perfect world everyone could conduct arbitrarily many experiments, thereby easily producing data at any desired location and eliminating the need for approximation or optimization.  Unfortunately, the experiments most often of interest are costly, requiring a heavy commitment of time and/or resources.  SigOpt tries to minimize this expenditure by extracting as much information as possible from a limited number of experiments.  The figure below has a graph on the left showing observed data and three possible approximations to that data which can be used to make predictions.

Figure 1: 5 observations appear in red, along with various predictive curves. The graph on the left shows three approximations to the observed data.  On the right, we see a graph with three different approximations, all of which are interpolants (passing through the observed red circles).

When SigOpt wants to create an approximation, we first turn to Gaussian processes; they are only part of a larger scheme, but probably the most important part.  Approximation with Gaussian processes is significant because predictions from Gaussian processes perform interpolation on data.  We say that a process interpolates observed data if the predictions perfectly reproduce all observations.  The figure above on the right shows examples of predictions which interpolate given data -- contrast the interpolants on the right with the approximations on the left which do not interpolate the data.  Fidelity is one term we use to describe how exactly predictions match observations: an interpolant has perfect fidelity.

As suggested in the figure above, there is no unique approximation that can be used to make predictions from given data; there are in fact infinitely [1] many predictive curves, even if we demand perfect fidelity.  How, then, can we decide which of them should be used to conduct predictions?  To do this, we use a mechanism called regularization [2] which gives us a criterion by which we can distinguish different interpolants with the same fidelity.  For this setting, the regularity of a prediction curve is defined as how “calm” it is; in the figure below, we depict predictions with a variety of regularities.  In general, it is preferable to have curves with high regularity because predictions are better behaved and less prone to erratic action.

Figure 2: On the left are two interpolants and on the right are two approximations to the same data we saw earlier.  In both pictures, the pink curve has greater regularity than the black curve.

So, let’s rehash where we are:

We can produce infinitely many interpolating predictions to observed data.  Each of these has perfect fidelity.We can (somehow) measure the regularity of a prediction curve.  We prefer high regularity curves to prevent erratic predictions.High fidelity (predictions closely follow the data) and high regularity (predictions are not erratic) are not related.

Based on what we have seen thus far, one strategy to make effective predictions is to create an interpolant with high regularity.  Luckily, Gaussian processes (which interpolate observed data) automatically maximize a key measurement [3] of regularity, making them a great tool for this job.  Unfortunately, Gaussian process predictions are not appropriate in every situation.  We will see below that perfect fidelity is not always desirable, requiring a modification from interpolation to make viable predictions.

One important reason that SigOpt modifies our Gaussian process predictions is because SigOpt customers often have observations in the presence of uncertainty.  We refer to such observations as noisy [4] data, where any observed value has some variance and cannot be fully trusted.  The figure below introduces the idea that measurements can be accompanied by an estimate of variance, which describes that an observed value is only a guide to the possible values at that location.

Figure 3: Here we consider the same observations as before, but now those observations have some uncertainty.  The blue bars should be determined by whomever made the observations.  Because of the variance in the observed data, it may not be clear whether a higher fidelity (orange) or higher regularity (purple) approximation is preferable.

When dealing with noisy data (as almost all of us are) the fidelity of predictions can be a tricky issue.  On one hand we want to respect the observed results, if for no other reason than because they were expensive to obtain and it is hard to justify an expensive experiment if the results are unused.  On the other hand, for an observation with uncertainty we should want to predict the most likely value at that location, not simply the value that we happened to observe.  Below we see a figure that depicts this conundrum; in particular, as more noisy data is included, interpolating predictions can have unreasonable oscillations.  In our search for optimal parameters, these oscillations could be misleading and cause unnecessary experimentation, thereby costing customers time and resources.

Figure 4: As more noisy data is observed (the new green points) we see the true value of higher regularity.  Compare the left and right images and notice that slight differences in the new values yield significantly different orange interpolants.  In contrast, the high regularity purple predictions have less fidelity but are much less affected by these small differences.

Any attempt to predict unobserved outcomes from noisy data requires a balance between fidelity and regularity: a balance between believing your eyes and recognizing that they may lie to you [5].  Choosing this balance appropriately is itself an interesting problem which may be addressed with, e.g., cross-validation.  SigOpt provides customers the opportunity to define uncertainty with their observations, and using this knowledge we can balance observed results against their variance to make predictions and identify the true behavior behind the uncertainty. By making this tradeoff appropriately we can guide you to better results with less trial and error.  Sign up today for a free trial to help you cut through the noise and get to better results.

[1]: Being able to guarantee an interpolant involves the property of unisolvency, which is important but beyond the scope of this post.

[2]: Regularization is a broadly used term in mathematics.  In its most general sense, a regularized form of a problem is needed for a problem where a unique solution does not exist.  To fully understand that though, we would need to discuss the nontrivial topics of existence and uniqueness, which would require their own blog post.  For this post, it is sufficient to think of regularization as a rephrasing of the interpolation problem into a more general approximation framework, one which balances fidelity and regularity.

[3]: This measurement often involves a norm, which is also important but beyond the scope of this post.  The norm that I refer to as “a key measurement” is the reproducing kernel Hilbert space norm.

[4]: I have stumbled across definitions of noisy data which equates it to meaningless data, a comparison which I find grossly misleading.  One could say that corrupted data is meaningless: consider measurements of a person’s height which were accidentally replaced by their number of toes as an example of corrupt, and meaningless, data.  But most observations have some degree of variance/noise (thank you Heisenberg) and that has not prevented us from using data to make cars, phones and computers work.  When you think of noisy data, I hope you do not think of it as meaningless; it is simply data in the presence of uncertainty.

[5]: Designing this balancing strategy can be more crucial that finding the approximation that fits it.  Chapters 15 and 18 of my recent book provide some discussion and citations on common strategies with cleanly defined predictions.

Mike studies mathematical and statistical tools for interpolation and prediction. Prior to joining SigOpt, he spent time in the math and computer science division at Argonne National Laboratory and was a visiting assistant professor at the University of Colorado-Denver where he co-wrote a text on kernel-based approximation. Mike holds a PhD and MS in Applied Mathematics from Cornell and a BS in Applied Mathematics from Illinois Institute of Technology.
Mike McCourt, PhD