It is often in the case of building a predictive model using an API, we readily take for granted the regularization that comes out of the box for that particular model. It is also the case that people new to machine learning treat regularization as a black box model. Now granted the fact that the mathematics behind regularization is not something every data scientist should know (not every driver has to know how the transmission system works to drive a car…as long as it works) but there are a few not-so-subtle ideas and reasoning that go behind the very cause of having regularization in the first place. In this post I intend to shed some light regarding these concepts which I believe to be crucial knowledge to have for all data scientists (by training or by practice) out there.

Overfitting

Now we all know regularization is performed so that a model doesn’t “overfit”, but what really is overfitting? Is it the exact opposite of underfitting? Let us walk through the answers to these questions without going into too much of mathematics.

Suppose we have a learning model $M$, like any other model it will have a learning algorithm $A$ and a hypothesis set $H$$A$ goes through the available data while picking a hypothesis $h$ from $H$ and finally arrives at a final hypothesis $g$ which best maps the pattern in the given data (according to our choice of $A$). This $g$ can be a straight line equation for a linear regression problem with a single predictor while $H$ is the set of all straight line equations, $g$ can be very complex or extremely simple (such as a constant value) depending upon our choice of $H$. It is this $g$ which tends to overfit the data and the only thing we should be worried about for now. We say $g$ has overfitted the data when it tries to capture the noise present in the training data. Noise is the set of all data points that deviate from the general pattern in the training data. A way to think about is by imagining the sales record of a hypothetical grocery store. Lets assume people generally but a lot of groceries at the beginning of each month and the number of items sold is pretty high for the first week or so. As the days of the month progresses, sales tend to go fall. This is the general pattern in sales of our grocery store. Now every now and then we can expect random spikes in sales caused by advertising or due to discounts on particular items. These spikes would be random and would deviate from the general pattern of sales record. These deviations are called noise and particularly this type of random noise is called “stochastic” noise. It is best not to have our $M$ capture stochastic noise unless there is a pattern in the advertising or discounts (in which case these would not be “noise” any more). When $g$ captures noise for 3 months in which there were decent amount of advertising and discounts on grocery products which stopped in the next month, the model’s prediction will be off by a high margin and the poor grocery store owner would think he is losing customers. As you might have guessed by now, stochastic noise is not the only cause for overfitting. As it turns out, there is another type of noise that is predominant in any kind of data. This noise is called “deterministic” noise. We can understand from the word “deterministic” that this noise is not random, we will even see that this noise is relative to our choice of $H$.

To understand deterministic noise, let us make an assumption that there exists a function $f$ in an arbitrary hypothesis set that gives the best possible fit of the data. In almost all the cases our model will not be good enough such that the models output $g = f$. This is because $f$ is not a member of $H$. Thus while training, our model will overfit parts of the data. This will have very less in sample error (training error) but have large out of sample error (test error). Now I guess some of my readers are thinking that my explanation of deterministic noise is very close to bias or underfitting. Infact the error due to deterministic noise is bias, but is not caused due to underfitting. We say underfitting occurs when there is large in sample error.

Another way to imagine deterministic noise is to imagine we have a data set with absolutely no stochastic noise and there exists $f$, a polynomial of degree 50 and is the perfect fit for the data. Now suppose we have two models, a quadratic model and the other one is a polynomial model of order 10. The quadratic model will have very poor in sample performance whereas the model with a 10th order polynomial may give very good in sample performance (maybe even zero training error). The first case is underfitting and the second case is overfitting as the out of sample performance of both the models will be poor since the true function is of degree 50.

Thus we can say deterministic noise depends upon our choice of the hypothesis set $H$. Assuming there is a $f$ that is a perfect fit of the data (both in sample and out of sample), we can say deterministic noise is the part of $f$  that $H$ cannot capture. We can now write the bias variance trade off in the overfitting regime as:

$Error_{out} = Variance + Bias + stochastic\ noise$

I intentionally didn’t give the derivation here, will put it up if requested.

Preventing Overfitting Without Regularization

The easiest way to avoid overfitting of your model is by increasing the number of training examples. Yes training a model with loads of data is always a good idea (unless the data has many outliers). This is simple and it works because the probability of a datapoint being noisy is less compared to the probability of it being noiseless. But how much data do we need? Will more data ensure that our model will not over fit? Does it take care of both stochastic noise and deterministic noise? Let’s have a look.

We cannot measure noise just by looking at the data. This proves to be the reason for not having a concrete relation between noise and size of the data set. Experiments show that the number of training examples needed to overcome the overfitting due to stochastic noise is directly proportional to the amount of stochastic noise. This poses a dilemma as we can’t tell the number of training examples we need since we don’t know the amount of stochastic noise present in the first place. Deterministic noise behaves a little different, the number of training examples needed is not completely proportional to the level of noise, this noise can be overcome with moderate amount of training examples when our hypothesis set is simple enough. When our hypothesis set consists of complex functions (such a high order polynomial), overcoming deterministic noise would require a larger amount of training examples.

Thus we can conclude that preventing overfitting by increasing the number of training examples is not an ideal solution. This is because we don’t know the amount of stochastic noise present and also because of the fact that more training data means more cost of training (both time and power). We need artificial ways to constraint our model so that it doesn’t overfit.

Regularization

One way of thinking about regularization is that you put an elastic leash on your model so that it doesn’t perfectly adjust it’s parameters or weights to fit all the data points that it sees. This way the model becomes a little imperfect so to speak and generalizes to out of sample data much better. The amount of regularization that is to be applied depends on the model that is being used and the data that is used for training. Let us now have brief overview of regularization techniques used in practice and see if we can find a well defined way to determine the amount of regularization that is to be applied in a particular situation.

There are two approaches one can take when applying regularization to a model. The first approach is the mathematical approach where the problem is well understood and we know the exactly how to regularize the model based on some assumptions (often unrealistic ones). The second approach is intuition or the heuristic approach where the model is way to complex or large (probably both) to know exactly how to regularize the model. The one thing that is common in both the approaches is the idea that in order to regularize the model we have to constraint the parameters or weights somehow. Let us go through some possibilities of weight constraining.

Hard Constraint

Suppose our model uses a polynomial hypothesis set $H_{n}$ where $n$ is the order of the polynomials. Now for a given data set our model gives a hypothesis $h$ which overfits the data. A harsh way to regularize this model would be to set the weights associated with the polynomial terms above order $m$ to $0$. This in effect is equivalent of using $H_{m}$ as the hypothesis set.

Soft Constraint

Another way to constraint the weights would be to fix a maximum threshold $C$ for the sum of the absolute value of the weights i.e. :

$\sum_{i}^{n} \left | w_{i} \right | \leq C$

A better way is constraining the sum of the squares of the weights i.e. :

$\sum_{i}^{n} w_{i}^{2} \leq C$

There are several variations of weight constraining, but almost all the methods are some sort of “weight decay” process. This means that in each step of the optimization algorithm, the values of the weights will tend to decay till a minimal constraint is satisfied.

How to Apply Regularization

As you might know or have guessed, regularization is done by adding the regularization term to the cost function $J$. I like to think of this as tricking the optimization algorithm (variations of gradient descent mostly) into thinking that the in sample error of the current hypothesis is greater than the actual in sample error. After adding the regularization term, the cost function of almost all models look more or less like:

$Cost(h) = Error_{in\ sample} + \lambda \Omega(h)$

where $\Omega$ is the regularizer and $\lambda$ is the regularization parameter and is usually a fraction. Some models may tie in the number of training examples into the regularization term by using $\frac{\lambda}{N}$ as the regularization parameter where $N$ is the number of training examples.

Choosing a Regularizer in Practice

Choosing a regularizer is not an exact science. Thus we only can rely on a couple of principles that can guide us hence choose the best one for the purpose.

• If the unconstrained model produces rough/jagged hypotheses, choose a regularizer that produces smoother hypotheses.
• If the model produces extremely complex hypotheses, choose a regularizer that leads to simpler hypothesis.

L1 vs L2 Regularization

The L1 norm and the L2 norm of the weights of a model are the two most popular methods of regularization is practice so I decided to discuss them a bit before moving on.

A norm in the most formal terms is a function that assigns a positive length/size to each vector in the vector space except for the zero vector. If we compute the norm of the weight vector(s) used in our model and try to minimize that norm along with the in sample error we get both weight decay and less in sample error at the same time.

The L1 norm or the Manhattan norm (also called the Taxicab norm) is defined as the sum of the absolute values of the coefficients of a vector:

$\left \| W \right \|_{1} = \sum_{i = 1}^{n} w_{i}$

L1 norm has limited popularity as a method for regularization, but is used when the feature space is highly sparse. L1 norm is non smooth and hence not differentiable (the left derivative is not equal to the right derivative), this means it doesn’t have a analytical solution. As a result, computation of the gradient of the cost function becomes much expensive. In terms of results, L1 seldom performs better than L2 and thus rarely used. It is only in the case of a very sparse feature set L1 out performs L2 as computation of the L2 norm becomes more expensive compared to L1.

The L2 norm or the Euclidean norm is defined as follows:

$\left \| W \right \|_{2} = \sqrt{\sum_{i}^{n} w_{i}^{2}}$

But the form that is used in regularization is the squared L2 norm:

$\lambda \frac{\sum_{i}^{n} w_{i}^{2}}{2}$

As you can see the derivative of this norm is simply the weight vector itself. This makes computing the gradient of the regularized cost function fairly simple.  L2 regularization has stood the test of time and is used in many ML algorithms.

Other Regularization Techniques

Some other regularization methods used in practice are:

• Weight emphasis: Here weights associated with a certain set of set of features are emphasized more by multiplying those weights with a small fraction while penalizing the weights associated with other features by multiplying them with a large number. This is a form of weight decay.
• Soft weight elimination: This method penalizes small weights highly, barely touching the large weights. The simplest version of this technique is $\Omega(W) = \sum_{i}^{n} \frac{w_{i}^{2}}{\beta^{2} + w_{i}^{2}}$ where $\beta$ is a fraction. This method has found use in neural networks.
• Dropouts: This one of the most recent technique developed to regularize neural networks. The idea is to randomly choose $\frac{1}{k}$ of the weights in a layer of the neural network and drop them (set them to zero), the other weights in that layer are scaled up by multiplying them with $k$. This leads to some interesting phenomena in the neural net but that’s another story. Dropouts is a version of weight elimination.

My Personal Views

I believe that the regularization term should be a measure of the complexity of the hypothesis set or the VC dimension. Thus in effect the complexity is penalized in each step of optimization.

Tuning The Regularization Parameter

Hyperparameter tuning is one of the crucial task of a data scientist, choosing the right amount of regularization can generalize the model such the out of sample error is very minimum. If a model is overfitting, the out of sample error can be written as the sum of in sample error and the overfit penalty:

$Error_{out}(h) = Error_{in}(h) + Overfit\ penalty$

Let us now see how we can estimate the out of sample error:

An out of sample data point simple means that it’s a data point that the model hasn’t trained on. If we randomly take out $K$ data points from the training set before training the model, we can use the set to measure the out of sample error. This sample of $K$ data points is called the validation set. Now this set is no way different than the test set except for the way we use it. We will use the validation set as a proxy for the test data set. Thus the validation set would estimate the out of sample error and the regularization term would estimate the overfit penalty. All that is left now is to choose the value of $\lambda$ for which the validation error is minimum provided the in sample error is also low.

Choosing $K$

Choosing the right size of the validation set could prove to be challenging. $K$ should be big enough so that it becomes a better proxy for the test set, hence bigger the validation set, better is the estimate of the out of sample error. $K$ should also be small enough so that predicting the target for the validation set doesn’t take up too much resources as the predictions would be made in each iteration. Another major factor that comes in the way of choosing a big enough validation set is that some times there is not enough data to split the training set further.

The solution is to use a K fold cross validation where K is usually a number from 3 to 10. In practice the 10 fold cross validation works remarkably well. The way this works is, the model is trained on 90% of the data in the train set and validation is done on the remaining 10% of the data. This is repeated for 10 iterations choosing a different set of data points for validation each time and finally taking the average of the out of sample errors obtained in each iteration. As a result we both train and validate on the entire data available to us. The average error of the 10 fold cross validation is very good estimate of the actual out of sample error.

Well that’s all for this post, please leave suggestions/reactions/questions in the comment box below. Cheers!