**Overfitting**

Now we all know regularization is performed so that a model doesn’t “overfit”, but what really is overfitting? Is it the exact opposite of underfitting? Let us walk through the answers to these questions without going into too much of mathematics.

Suppose we have a learning model , like any other model it will have a learning algorithm and a hypothesis set . goes through the available data while picking a hypothesis from and finally arrives at a final hypothesis which best maps the pattern in the given data (according to our choice of ). This can be a straight line equation for a linear regression problem with a single predictor while is the set of all straight line equations, can be very complex or extremely simple (such as a constant value) depending upon our choice of . It is this which tends to overfit the data and the only thing we should be worried about for now. We say has overfitted the data when it tries to capture the noise present in the training data. Noise is the set of all data points that deviate from the general pattern in the training data. A way to think about is by imagining the sales record of a hypothetical grocery store. Lets assume people generally but a lot of groceries at the beginning of each month and the number of items sold is pretty high for the first week or so. As the days of the month progresses, sales tend to go fall. This is the general pattern in sales of our grocery store. Now every now and then we can expect random spikes in sales caused by advertising or due to discounts on particular items. These spikes would be random and would deviate from the general pattern of sales record. These deviations are called noise and particularly this type of random noise is called “stochastic” noise. It is best not to have our capture stochastic noise unless there is a pattern in the advertising or discounts (in which case these would not be “noise” any more). When captures noise for 3 months in which there were decent amount of advertising and discounts on grocery products which stopped in the next month, the model’s prediction will be off by a high margin and the poor grocery store owner would think he is losing customers. As you might have guessed by now, stochastic noise is not the only cause for overfitting. As it turns out, there is another type of noise that is predominant in any kind of data. This noise is called “deterministic” noise. We can understand from the word “deterministic” that this noise is not random, we will even see that this noise is relative to our choice of .

To understand deterministic noise, let us make an assumption that there exists a function in an arbitrary hypothesis set that gives the best possible fit of the data. In almost all the cases our model will not be good enough such that the models output . This is because is not a member of . Thus while training, our model will overfit parts of the data. This will have very less in sample error (training error) but have large out of sample error (test error). Now I guess some of my readers are thinking that my explanation of deterministic noise is very close to bias or underfitting. Infact the error due to deterministic noise is bias, but is not caused due to underfitting. We say underfitting occurs when there is large in sample error.

Another way to imagine deterministic noise is to imagine we have a data set with absolutely no stochastic noise and there exists , a polynomial of degree 50 and is the perfect fit for the data. Now suppose we have two models, a quadratic model and the other one is a polynomial model of order 10. The quadratic model will have very poor in sample performance whereas the model with a 10th order polynomial may give very good in sample performance (maybe even zero training error). The first case is underfitting and the second case is overfitting as the out of sample performance of both the models will be poor since the true function is of degree 50.

Thus we can say deterministic noise depends upon our choice of the hypothesis set . Assuming there is a that is a perfect fit of the data (both in sample and out of sample), we can say deterministic noise is the part of that cannot capture. We can now write the bias variance trade off in the overfitting regime as:

I intentionally didn’t give the derivation here, will put it up if requested.

**Preventing Overfitting Without Regularization**

The easiest way to avoid overfitting of your model is by increasing the number of training examples. Yes training a model with loads of data is always a good idea (unless the data has many outliers). This is simple and it works because the probability of a datapoint being noisy is less compared to the probability of it being noiseless. But how much data do we need? Will more data ensure that our model will not over fit? Does it take care of both stochastic noise and deterministic noise? Let’s have a look.

We cannot measure noise just by looking at the data. This proves to be the reason for not having a concrete relation between noise and size of the data set. Experiments show that the number of training examples needed to overcome the overfitting due to stochastic noise is directly proportional to the amount of stochastic noise. This poses a dilemma as we can’t tell the number of training examples we need since we don’t know the amount of stochastic noise present in the first place. Deterministic noise behaves a little different, the number of training examples needed is not completely proportional to the level of noise, this noise can be overcome with moderate amount of training examples when our hypothesis set is simple enough. When our hypothesis set consists of complex functions (such a high order polynomial), overcoming deterministic noise would require a larger amount of training examples.

Thus we can conclude that preventing overfitting by increasing the number of training examples is not an ideal solution. This is because we don’t know the amount of stochastic noise present and also because of the fact that more training data means more cost of training (both time and power). We need artificial ways to constraint our model so that it doesn’t overfit.

**Regularization**

One way of thinking about regularization is that you put an elastic leash on your model so that it doesn’t perfectly adjust it’s parameters or weights to fit all the data points that it sees. This way the model becomes a little imperfect so to speak and generalizes to out of sample data much better. The amount of regularization that is to be applied depends on the model that is being used and the data that is used for training. Let us now have brief overview of regularization techniques used in practice and see if we can find a well defined way to determine the amount of regularization that is to be applied in a particular situation.

There are two approaches one can take when applying regularization to a model. The first approach is the mathematical approach where the problem is well understood and we know the exactly how to regularize the model based on some assumptions (often unrealistic ones). The second approach is intuition or the heuristic approach where the model is way to complex or large (probably both) to know exactly how to regularize the model. The one thing that is common in both the approaches is the idea that in order to regularize the model we have to constraint the parameters or weights somehow. Let us go through some possibilities of weight constraining.

**Hard Constraint**

Suppose our model uses a polynomial hypothesis set where is the order of the polynomials. Now for a given data set our model gives a hypothesis which overfits the data. A harsh way to regularize this model would be to set the weights associated with the polynomial terms above order to . This in effect is equivalent of using as the hypothesis set.

**Soft Constraint**

Another way to constraint the weights would be to fix a maximum threshold for the sum of the absolute value of the weights i.e. :

A better way is constraining the sum of the squares of the weights i.e. :

There are several variations of weight constraining, but almost all the methods are some sort of *“weight decay” *process. This means that in each step of the optimization algorithm, the values of the weights will tend to decay till a minimal constraint is satisfied.

**How to Apply Regularization**

As you might know or have guessed, regularization is done by adding the regularization term to the cost function . I like to think of this as tricking the optimization algorithm (variations of gradient descent mostly) into thinking that the in sample error of the current hypothesis is greater than the actual in sample error. After adding the regularization term, the cost function of almost all models look more or less like:

where is the regularizer and is the regularization parameter and is usually a fraction. Some models may tie in the number of training examples into the regularization term by using as the regularization parameter where is the number of training examples.

**Choosing a Regularizer in Practice**

Choosing a regularizer is not an exact science. Thus we only can rely on a couple of principles that can guide us hence choose the best one for the purpose.

- If the unconstrained model produces rough/jagged hypotheses, choose a regularizer that produces smoother hypotheses.
- If the model produces extremely complex hypotheses, choose a regularizer that leads to simpler hypothesis.

**L1 vs L2 Regularization**

The L1 norm and the L2 norm of the weights of a model are the two most popular methods of regularization is practice so I decided to discuss them a bit before moving on.

A norm in the most formal terms is a function that assigns a positive length/size to each vector in the vector space except for the zero vector. If we compute the norm of the weight vector(s) used in our model and try to minimize that norm along with the in sample error we get both weight decay and less in sample error at the same time.

The L1 norm or the Manhattan norm (also called the Taxicab norm) is defined as the sum of the absolute values of the coefficients of a vector:

L1 norm has limited popularity as a method for regularization, but is used when the feature space is highly sparse. L1 norm is non smooth and hence not differentiable (the left derivative is not equal to the right derivative), this means it doesn’t have a analytical solution. As a result, computation of the gradient of the cost function becomes much expensive. In terms of results, L1 seldom performs better than L2 and thus rarely used. It is only in the case of a very sparse feature set L1 out performs L2 as computation of the L2 norm becomes more expensive compared to L1.

The L2 norm or the Euclidean norm is defined as follows:

But the form that is used in regularization is the squared L2 norm:

As you can see the derivative of this norm is simply the weight vector itself. This makes computing the gradient of the regularized cost function fairly simple. L2 regularization has stood the test of time and is used in many ML algorithms.

**Other Regularization Techniques**

Some other regularization methods used in practice are:

- Weight emphasis: Here weights associated with a certain set of set of features are emphasized more by multiplying those weights with a small fraction while penalizing the weights associated with other features by multiplying them with a large number. This is a form of weight decay.
- Soft weight elimination: This method penalizes small weights highly, barely touching the large weights. The simplest version of this technique is where is a fraction. This method has found use in neural networks.
- Dropouts: This one of the most recent technique developed to regularize neural networks. The idea is to randomly choose of the weights in a layer of the neural network and drop them (set them to zero), the other weights in that layer are scaled up by multiplying them with . This leads to some interesting phenomena in the neural net but that’s another story. Dropouts is a version of weight elimination.

**My** **Personal Views**

I believe that the regularization term should be a measure of the complexity of the hypothesis set or the VC dimension. Thus in effect the complexity is penalized in each step of optimization.

**Tuning The Regularization Parameter**

Hyperparameter tuning is one of the crucial task of a data scientist, choosing the right amount of regularization can generalize the model such the out of sample error is very minimum. If a model is overfitting, the out of sample error can be written as the sum of in sample error and the overfit penalty:

Let us now see how we can estimate the out of sample error:

An out of sample data point simple means that it’s a data point that the model hasn’t trained on. If we randomly take out data points from the training set before training the model, we can use the set to measure the out of sample error. This sample of data points is called the validation set. Now this set is no way different than the test set except for the way we use it. We will use the validation set as a proxy for the test data set. Thus the validation set would estimate the out of sample error and the regularization term would estimate the overfit penalty. All that is left now is to choose the value of for which the validation error is minimum provided the in sample error is also low.

**Choosing **

Choosing the right size of the validation set could prove to be challenging. should be big enough so that it becomes a better proxy for the test set, hence bigger the validation set, better is the estimate of the out of sample error. should also be small enough so that predicting the target for the validation set doesn’t take up too much resources as the predictions would be made in each iteration. Another major factor that comes in the way of choosing a big enough validation set is that some times there is not enough data to split the training set further.

The solution is to use a K fold cross validation where K is usually a number from 3 to 10. In practice the 10 fold cross validation works remarkably well. The way this works is, the model is trained on 90% of the data in the train set and validation is done on the remaining 10% of the data. This is repeated for 10 iterations choosing a different set of data points for validation each time and finally taking the average of the out of sample errors obtained in each iteration. As a result we both train and validate on the entire data available to us. The average error of the 10 fold cross validation is very good estimate of the actual out of sample error.

Well that’s all for this post, please leave suggestions/reactions/questions in the comment box below. Cheers!

]]>In this blog post (which is also my first ever blog post) I will try to explain the concepts that work behind the functioning of XGBoost. I would expect my readers to know the very fundamentals of supervised learning before making an attempt to read my post.

XGBoost comprises of a tree based model, i.e. a collection of decision trees. The number of trees required for a particular problem can be passed as a parameter while building the model. Mathematically the model can be represented as:

where * *is a decision tree.

The prediction for the data point an be written as:

where is the feature vector of the data point.

The prediction at the step can be defined as:

Like any other gradient descent based machine learning algorithm, the goal of XGBoost is to minimize it’s error of prediction while training. The prediction error is called loss and is calculated using a function called the loss function or cost function. XGBoost tries to optimize the three standard cost functions:

- RMSE (Rooted Mean Squared Error)
- This is only used when we are trying to predict or score targets values with XGBoost. RMSE is expressed as:

- This is only used when we are trying to predict or score targets values with XGBoost. RMSE is expressed as:
- LogLoss (Logarithmic Loss)
- Logarithmic loss is measured to check the classification error of a binary classifier. LogLoss is the measure of cross entropy and acts as a soft measurement of accuracy. By minimizing cross entropy we can maximize the accuracy of a classifier. LogLoss cost function is expressed as:

where is the prediction

- Logarithmic loss is measured to check the classification error of a binary classifier. LogLoss is the measure of cross entropy and acts as a soft measurement of accuracy. By minimizing cross entropy we can maximize the accuracy of a classifier. LogLoss cost function is expressed as:
- mLogLoss (Multiclass Logarithmic Loss)
- This is used to measure the prediction loss of a multi class classifier and is essentially a generalized version of LogLoss which is used only for binary classifiers. The mLogLoss function is expressed as:

- This is used to measure the prediction loss of a multi class classifier and is essentially a generalized version of LogLoss which is used only for binary classifiers. The mLogLoss function is expressed as:

While optimizing the objective is our primary goal, that’s not the only thing that we should keep in mind while training a machine learning model. During training it is often the case where the model learns the training data by heart and fails to predict the target of any new data point. In other words the model builds intricate structures (or equations) within itself that can correctly predict the target for all (or most) of the data points in the training set but will fail when tested with “unseen” data. Think of this as connecting three collinear dots with an ‘S’ like curve instead of maybe a straight line. This problem is known as overfitting.

In order to make the model more generalized, we need to “relax” the model a bit in each training step, this process is called regularization. In order to prevent a model from over complicating itself, we must first define the model complexity. The complexity of XGBoost can be represented as:

Where is the number of leaves and is the score on the leaf. and are constants which can be passed as parameters while building the model and are used to tune the degree of regularization or relaxation.

Now that we have our cost function and complexity of the model defined, lets use them to redefine our goal:

Gradient descent is the chosen algorithm by XGBoost to optimize it’s goal or objective function. Let’s first define gradient descent:

Given an objective , for each iteration, the partial derivative is calculated. Then is improved along the direction of the gradient to minimize the objective. This process continues till there can be no further improvements i.e. till the global optima is reached (convergence).

For the gradient descent in XGBoost, the objective function that needs to be differentiated in the step is:

But we need not calculate at every step if we store the value of the loss function in the previous step. The equation then becomes:

The performance of gradient descent can be further improved if we consider both the first and the second order gradient of the objective function. Thus is in each step we should calculate:

This will lead to faster convergence as the second order derivative gives us a proper approximation of the direction of descent in each step.

Since derivatives of all functions can’t be calculated, we have to calculate the second order Taylor’s approximation of the derivative. Our objective function now becomes:

where:

- is the first order gradient and
- is the second order gradient of the cost function.

Now since becomes a constant, our objective at the step boils down to:

All that remains now is to find a proper to optimize it.

As easy as it sounds, finding a that optimizes a tree based model is not so trivial. Let’s state the problem of finding in another way:

**“How do we find a decision tree that improves it’s prediction along the gradient?”**

XGBoost has a very clever implementation of the solution, but before diving in let’s first understand the concepts that are used:

- Internal Nodes
- Each internal node splits the flow of data points by using one feature from the feature space.
- Each node selects a threshold value from a feature and based upon some conditions on that threshold, specifies the direction of the flow of data (either left child node or the right child node).
- Leaves
- The flow of data end at leaves. The leaves are assigned a weight.
- The weight of a leaf corresponds to the leaves prediction.

This, as you might have realized is the basic of any CART (Classification and Regression Tree) building algorithm, but the devil lies in the details. The two key questions XGBoost addresses are:

- How to assign prediction score to the leaves?
- How to find a proper tree structure?

Let us know try to understand how XGBoost answers these two questions.

Mathematically, a tree can be represented as:

where is a “directing” function which assigns every data point to the leaf.

With this definition of a tree, we can describe the weight assigning process as:

- Assign the data point to a leaf by .
- Assign the corresponding score on the leaf to the data point.

Where is the leaf index and the index set is defined as set of indices of the data points that are assigned to the leaf. We can write the index set as:

The objective function can (again) be re-written as:

where is the number of data points, but instead of writing the objective function as a summation of separate data points, we can write it in a fashion where we consider the summation of leaf weights. Thus the new representation becomes:

(Since all the data points on the same leaf share the same prediction)

As you might have noticed, the above equation is a equation is a quadratic equation of . All that remains now is to find a solution to the quadratic equation to get a vector which will optimize the objective. The solution is very simply:

Then the corresponding value of becomes:

The algorithm to find a proper tree structure is in each split, the best splitting point is found that can optimize the objective. This is implemented as-

For each feature:

- Sort the numbers.
- Scan the best splitting point (more on this in the next section)
- Choose the best feature.

Every time a split is made, a leaf node gets converted to an internal node. If a particular split improves the objective function, the gain for the split is positive, but if the split degrades the model, the gain is negative. So how do we calculate gain?

Let:

- be the set of indices of data points assigned to a node.
- and be the set of indices of data points assigned to the new left and right leaves respectively

So the gain becomes:

While building a tree, the best splitting point is found until the max depth (supplied as a parameter) is reached. The nodes with negative gain are deleted by pruning the tree in a bottom up manner. The intuition behind this is that it is okay to have a node with negative gain in the middle of the tree as the gain the might increase significantly down it’s child nodes.

Note that this is an unique feature of XGBoost as traditionally, gain is measured by calculating Gini coefficient or by entropy matrix.

The most unique feature of XGBoost is it’s inherent ability to handle missing values. For any kind of data set, handling missing values is always a challenge. There are both statistical and machine learning processes to impute missing values in a data set, but XGBoost avoids any such method all together. I will now make an attempt to describe how XGBoost handles missing values.

For each node, all the data points with missing values are:

- Put to the left child node and the maximum gain is calculated.
- Then they are put to the right child node and the maximum gain is calculated.
- The direction with larger gain is chosen.

Hence every node has a default direction for missing values.

To sum it up in a few lines, the work flow of XGboost is:

For times do:

- Grow the tree to the maximum possible dept.
- Find the best splitting point.
- Assign weights to the leaves.
- Prune the tree from bottom up to delete nodes with negative gain.