By Kasirat Turfi Kasfi

When I sat down to brainstorm a topic to write for the DEEP blog, I thought what would be that one thing that interest me and the rest of the DEEP members? Data, right! We all work with data, we want to find meanings and patterns in our data. We want to be able to make inference, make decisions based on or make predictions from our data. We do that by using statistical learning methods.

I believe that most of us are familiar with one or more statistical learning models. And there seems to be quite a few of them out there! It can be difficult to decide which one to use! Or knowing which one is the best! There is no best method that fits all kinds of data perfectly, and no one method “to rule them all” (only if Tolkien invented the models!). I would therefore like to explain intuitively the bias-variance analysis that helps us to understand if a model is going to capture the true pattern in the “seen” data and will also generalize well to “unseen” data, thus help us in selecting the best model for the given data.

Before diving into what “bias and variance trade-off” means, let’s get a little background on “statistical learning”. Any supervised statistical learning model will try to find a hypothesis function ħ (i.e. mathematical representation) that approximates the relationship between the predictors (independent variables) and the response (dependent variable). The measure of how well the hypothesis function ħ is fitting the given data can be found by getting the error E between the output of the hypothesis function ħ and the output of the target function ƒ that describes the data. It is called an error because it represents the gap between the hypothesis function and the target function. The smaller the value of the error E, better the learning, meaning that the hypothesis has approximated the target function ƒ well. Therefore, there are two objectives: a) finding a good approximation of target function f, and b) the approximation holding for out-of-sample data. A more complex (meaning bigger) hypothesis set has a better chance of approximating the target function, because it is more likely to hold the target function in the set, but it becomes increasingly harder to find that needle in the haystack! On the other hand, if the hypothesis set is simpler (smaller) then it may not hold the target function in it, but luckily if it does hold the function then it is easier to find. In order to find the best candidate function in the hypothesis set, the hypothesis set must be navigated through the means of the sample data provided, which is the only resource in finding one hypothesis over the other.

Bear with me, I will soon get to an explanation of what I am talking about with an illustration! Just two more paragraphs to go!

Now, getting back to bias and variance, these two entities are inherent properties of a learning model. Mathematically, when the error term E is decomposed, we get bias and variance [2]. Simply put, the trade-off is between approximation and generalization, between bias and variance. The total error term E measures how far the hypothesis function ħ learned from the given data is from the target function ƒ. Of the decomposed entities, bias is a measure of approximation ability, it measures how far the best approximation is from the target function ƒ and the variance is a measure of how far the hypothesis function ħ learned from a dataset is from the best possible candidate function that could be obtained from the hypothesis set H. The hypothesis set H chosen is dependent on the data that is provided, so a different set of data will give a different hypothesis set to choose from. (This dependency is very important in the bias-variance analysis.)

The trade-off is that if bias goes up, then the variance goes down, or if the bias goes down the variance goes up. If the hypothesis set H gets bigger, the bias gets smaller, getting it closer to f, but then there is a greater variety to choose from for the function which increases the variance. Below is the graph [1] of error E, and the relationship between bias and variance, this relationship is independent of the data, and holds for any statistical learning model. Model complexity is equivalent to hypothesis set size, meaning it holds functions with greater complexity.

Finally, if readers are still with me, here is an example with illustration!

Let me explain this trade-off using an example. Imagine that we have to find a target function ƒ. In real-life the target function is what we find by learning, but here for the purpose of illustration let’s assume we know the target function. Assume the target function ƒ is a sinusoid (displayed with orange line in the following figures for the rest of the article). Our objective is to find the best approximation of the target function ƒ given some data points. Also assume that we are using two hypothesis sets namely H0 and H1, where H0 is the constant model, and H1 is the linear model. Again, assume for illustration purpose that we only have these two hypothesis sets, to keep things simple! Both hypothesis sets will give an approximate function and we compare which one is better using the bias-variance analysis.

We start off with approximation, before doing any learning. The H0 hypothesis set should only give constants, and the H1 hypothesis set only gives lines. In the two graphs below, the light grey shaded regions represent all the possible functions (constants and lines) the H0 and H1 model can generate from the range of data points that are available. The shapes of the grey shaded region are a result of the model complexity and the available data points. The olive lines are the mean of each of the hypothesis sets representing the best of that hypothesis set.

The linear model will choose the function that will get most of the data points it possibly can. And the constant model will be better off choosing zero, as the error will be squared. As expected, we can see from the figures below, (the shaded area showing the errors), that clearly the linear model is the winner, it is a better approximate as the error is the smaller of the two, in fact for the constant model, all of it is an error!

Now, let’s look at the generalizing ability of the two models. Let’s say we have only two example data points (yes, we are stingy!) that we will use to approximate the sinusoid using the H0 and H1 hypothesis sets. The first figure below shows the points. The second shows the points with the target function. The third shows the points fitted with approximate functions from the hypothesis sets that best fits the data points provided.

Let’s bring back the target function and see how much error we get for the constant and the line. As per expectation the error is smallest for the linear model.

The constant and the line we have here are dependent on the data set, if we had another two points then the approximation of the constant and the line would be different. From a learning perspective, which model is generalizing well? Until now we have been looking at a subset of the dataset that defined the target function (the sinusoid). Now if we look at more datapoints from the population set, we are in for a surprise!

For all the unseen data points stretching infinitely before and after the datapoints we were working with, the constant model at least makes the right predictions periodically, but for the linear model it is a complete disaster! The variance error in the constant model stays constant, whereas the variance error for the linear model keeps increasing with more datapoints. In conclusion, the bias-variance analysis helped us to figure out, that for this particular target function (a sinusoid), given a set of data points and choices between a constant and a linear model we will be better off choosing the constant model that will not be the best approximation of the target but will be the best generalized model. In this case we traded off approximation(bias) for generalization(variance).

Reference and Source:

[1]

[2] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical

learning (Vol. 112, p. 18). New York: springer.

[3] All graphs plotted using Matplotlib.Pyplot

## Comments