This is the first of three blog posts during which we explore the concept of uncertainty - or noise - and its implications for Bayesian optimization. This is part of our series of blog content on research that informs our product and methodologies.
A Stroll to Work
In everyday life we encounter uncertainty; for example, the trip to work might take between 15 and 30 minutes instead of a consistent 22 minutes each day. The exact time depends on a number of random unpredictable factors -- unexpected rain could cause a traffic jam, your train might be delayed because of a system failure, or you could just walk slower at an inconsistent pace.
Of course, the idea that uncertainty exists is neither surprising nor intrinsically problematic. The complication from uncertainty appears when trying to predict the transit time so as to plan your morning out. If a 15 minute commute were predicted, planning a meeting in 15 minutes might be reasonable. Should the actual commute time turn out to be 30 minutes, the meeting may be concluded prior to arrival.
In this blog post, we briefly consider why uncertainty requires attention during modeling, and how predictions from such models will vary. In particular, when uncertainty is present, reliable predictions are those predictions which follow the average, or expected, behavior of the process under observation, without chasing uncertainty present in the observations. We also mention some ML specific issues and common practices for building trustworthy models.
Sources of Uncertainty
As suggested earlier, one goal of better understanding the commute time is to be able to make predictions using that understanding. In that context, it can be beneficial to consider two possible sources of uncertainty.
- Intrinsic uncertainty - The uncertainty fundamentally present in the situation. This may result from uncertainty in the traffic density or the weather, potential car crashes, police activity, etc.(1) Another name for this might be process uncertainty.
- Observation uncertainty - The uncertainty which occurs during the measurement of the situation. For the commute time situation, looking at a mobile phone at the beginning and end of a trip would yield accuracy on the order of 1 minute; using a stopwatch might improve the accuracy to 1 second. Another name for this might be noise.
When recording outcomes, it is generally impossible(2) to distinguish between these two (though appropriate calibration of measurement equipment can help understand observation uncertainty). For modeling purposes, decomposing uncertainty into components can be beneficial.
Uncertainty in Approximation/Regression
In one of our previous blog posts, we talked about approximation of data, both with and without uncertainty. We revisit this now to see the potential impact of modeling the sources of uncertainty distinctly.
Imagine if each day for 18 days we measured the time required for us to walk to work; on each day we leave 12 minutes later than the last day, starting at 6:30am. The results could be plotted in a graph like the one below.
Figure 1: We have walked to work several consecutive days and recorded the transit times as a function of when we departed. This data is interpolated by the simplest strategy possible(3), connect-the-dots.
In a deterministic world, it is perfectly reasonable to require our model to interpolate previously observed data. After all, absent any randomness, all observed results would recur in subsequent observations and an appropriate model should not want to contradict those observations.
As is the theme of the post, though, these observations were recorded with some uncertainty; first, each day will have a different transit time caused by random factors, and our observation accuracy is limited (off by as much as 1 minute). Thus, before considering the development of a model, we must ask the following question: “What is the purpose of this model?”.
Our purpose is to be able to determine the duration of the commute to work given the time, so, if we recognize that randomness contributed to our observations, we might want a model that does not actually “pass through” the observed outcomes. The figure below gives a number of such models.
Figure 2: We have fit a selection of possible models of the observed transit times, one in each plot. The blue lines represent possible predicted outcomes, to be used for estimating the time of a future trip. Because these blue lines have some transparency, the more heavily blue regions are predicted to be more likely to occur.
Clearly, the models presented above have a wide variety of possible behaviors. In the top left, the model very closely follows the connect-the-dots plot from earlier; in the bottom right, the model seems to be nearly flat and apathetic to the observations. Of course, none of these models is "right", but some of them are more likely than others, given certain assumptions that could be made regarding the process of walking to work. Additionally, a desire for good predictions from such a statistical model is often formalized in the bias-variance tradeoff to prevent overfitting to randomness.
In the context of creating models of appropriate complexity for a given purpose, the scale for which predictions are desired is relevant. The figure below shows data where different goals can yield different models. In particular, the goal here is that stating the expected certainty with which measurements can be made (or with which predictions must be made) can help inform the modeling (and data collection) process and potentially any subsequent statistical analysis (discussed here in a case study in an adjacent context).
Figure 3: Data is observed at random times from a voltmeter. left: If the goal is to generate a model for this data on the order of 1 V, then the simplest strategy is to fit a constant model and assume variations are only observation uncertainty. right: If the goal is to generate a model on the order of .01 V, then a more interesting model may be warranted.
Uncertainty in Machine Learning
In the context of discriminative(4) machine learning for classification, uncertainty plays a significant role. The bias-variance tradeoff must still be considered, but it can be less easily analyzed (see this article for a discussion on the topic). The sense of accuracy or fidelity of the model is often complicated by three sources of uncertainty.
- How representative are the individuals comprising the dataset of the actual population of interest?
- Any finite dataset is just an approximation of the true population upon which we hope to make predictions. An appropriate model should have a good generalization capacity.
- Much as was described in the approximation discussion above, actual measurements are subject to uncertainty (someone’s height, the temperature, etc).
- Even the approximation situation above will have inaccuracy in the desired outcome on the order of at least machine precision, if not much higher. If a method such as stochastic gradient descent is used to fit a model, the result will likely be random and thus will have some amount of uncertainty.
Numerous strategies exist to combat this overfitting of classification models. Some of them have a theoretical genesis (such as penalizing the residual/loss function for having models which are too complex(5), e.g., Tikhonov regularization). Others have a very practical motivation (such as using dropout to miscompute quantities and eventually learn a better model). A method such as cross-validation may have both practical motivations and theoretical support. For probabilistic models, Bayesian decision theory might suggest the model evidence be used to prefer simpler models when possible. This website provides a good writeup and literature review for mining noisy data.
In an interesting twist, adding noise/mistakes to datasets (data augmentation) can actually improve performance of the models on which it is trained. On one hand, it can increase the size of the training data (now including both the original observation as well as several copies with perturbed features). On the other hand, as shown notably in analyzing ImageNet, data augmentation can also produce models which are more robust to overfitting the actual data present in the dataset.
Here at SigOpt, our customers live in the real world. Data in the real world has uncertainty: surveys have uncertainty, user opinions have uncertainty, political polls have uncertainty, even an adult’s height could have uncertainty. Rather than being crippled by this uncertainty, we want to help our customers produce models which are robust enough that they can be used to make effective decisions based on reliable predictions.
We have discussed the need to consider the uncertainty present in data when producing models. Both in approximation/regression as well as classification, building statistical models which are accurate and well-behaved will yield better predictions. Our following post, the second in the uncertainty series, focuses on the implications of uncertainty in Bayesian optimization, a key engine powering SigOpt's enterprise solution. We hope that you will continue on next week to learn how we benefit from effective modeling in the presence of uncertainty.
(1) An interesting topic to consider is the tradeoff between deterministic and stochastic modeling. We discussed this briefly in this post, and will not delve any further here. However, we use this opportunity to mention the concept of a filtration. This is a mathematical mechanism for defining things that are known and/or measureable (related to being adapted) prior to an outcome being recorded/realized. In this example, before leaving for work we may know the weather but may not know if a bank robbery is in progress and thus would be unable to account for the police presence outside the bank. Short of philosophical discussions, some amount of uncertainty is always present in practical modeling.
(2) The inability to distinguish between sources of uncertainty may seem unsettling. Part of the reason is because the interaction between the trend or expected value of a process, intrinsic uncertainty and observation uncertainty can be arbitrarily complicated. But even assuming a simple additive structure, it would still be infeasible. Consider the number 5; if told that 3 numbers x, y, z were summed to recover this number we could not uniquely identify x, y, z. Perhaps x=1, y=1, z=3 or x=0, y=0, z=5, or x=-10, y=15, z=0.
(3) Some may argue that a piecewise constant function would be the simplest interpolant of this data, but we do not believe that if you put those dots in front of a child that a child would implement the nearest neighbor strategy.
(4) Generative and unsupervised machine learning must also consider the impact of uncertainty, but the factors at play are too numerous to consider here. As such, we limit ourselves to only the discriminative case for this post.
(5) A discussion of complexity is, ironically, too complex to have here. At its most abstract, it involves the concept of sample complexity and the VC dimension. It has a very practical impact on model selection, but can be difficult to quantify in a general sense. For linear regression, complexity might be related to the model weights. For a SVM, it might be related to the number of support vectors. For a neural network, it is perhaps the number of layers, or nodes, the model weights, or any number of other things.