How to improve your Linear Models?

Part — 1: Subset Selection

6 min readMay 19, 2021

Photo by Birmingham Museums Trust on Unsplash

The regression models assume a linear relationship between predictors and response variables. They try to estimate the coefficients of the following expression.

And the model fitting is typically done using least-squares. However, the least-squares method doesn’t always produce satisfactory results. In fact, there are two main reasons we should consider using other fitting procedures over least-squares.

Prediction Accuracy → If the number of observations is not much larger than the number of predictors, then there can be high variance and thus poor predictions can be seen. Also, if the number of predictors is greater than the number of observations, there is no longer a unique solution to the expression, thus the variance becomes infinite, and the method becomes useless.
Model Interpretability → It is often the case that some of the variables used in a multiple regression model are not related to the response. Including such variables causes unnecessary complexity in the resulting model. It is important to set the corresponding coefficients of these variables to zero. Whereas, least-squares is extremely unlikely to give any coefficient estimates that are exactly zero.

Hence to address the above problems, we can use the following three approaches.

Subset Selection: We select only those predictors which are associated with the response variable. We then fit a model using least-squares on this reduced set of variables.
Shrinkage: We fit a model involving all predictors. However, the estimated coefficients are shrunken towards zero. This shrinkage (regularization) reduces variance to some extent. Some of the coefficients may be estimated to be exactly zero, depending on what type of shrinkage is performed.
Dimensionality Reduction: We project the p predictors into an M-dimensional subspace, where M is smaller than p. M different linear combinations are computed and used as final predictors to fit the linear model using least-squares.

Subset Selection

There are two major ways of selecting subsets of predictors.

Best Subset Selection
Step-wise Model Selection

Best Subset Selection

In this method, we fit a least squares regression model for every possible combination of p predictors. We then identify the best model out of all these models. The following algorithm will explain it better.

Algorithm (Best Subset Selection)

Let Mo denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
For k =1,2,3,…p:
Fit all (p C k) models that contain exactly k predictors.
Pick the best among these (p C k) models, and call it Mk. Here best is defined as the model having the smallest RSS or the largest R2.
Select a single best model from among Mo,….., Mp, using cross-validation prediction error, AIC, BIC, or R2.

In the algorithm above, step 2 identifies the best model for each subset size and reduces the problem from one of the 2p possible models to p+1 possible models.

Now in order to select a single best model, we must simply choose among these p + 1 options. This task must be performed with care because the RSS of these models decreases, and the R2 increases, as the number of features increases. Therefore, if we use these statistics to select the best model, then we will always end up with a model involving all of the variables. The problem is that a low RSS or a high R2 indicates a model with a low training error, whereas we wish to choose a model that has a low test error. Therefore, in Step 3, we use the cross-validated prediction error to select the best one.

While best subset selection is a simple and appealing approach, it is computationally expensive. The number of possible models to be considered grows exponentially as p increases. So if p=10, then there are 1024 cases to be considered, and if p=20, the possibilities jump to over a million. Just imagine this number with 40 predictors.

This problem can be addressed to some extent by Step-wise Selection methods discussed in the next section.

Stepwise Selection

There are two ways in which Step-wise Selection can be implemented.

Forward Selection
Backward Selection

Forwards Selection

Forward stepwise selection begins having no predictors and then adds predictors
to the model, one at a time. At each step, the variable with the greatest additional improvement to the fit is added to the model. The forward stepwise selection procedure can be understood by the following algorithm.

Algorithm (Forward Selection)

Let Mo be the model with no predictors.
For k=0,1,2…,p-1:
Consider all p — k models that augment the predictors in Mk with one additional predictor.
Choose the best among these p — k models, call it Mk+1. Here best is defined as having the smallest RSS or the largest R2.
Select the single best model from among Mo,…, Mp using cross-validated error.

Unlike the Best Subset method, which involved 2p models, forward stepwise selection involves fitting of 1 + p(p+1)/2 models. This is a huge difference. If we consider p=20, the number of models reduces down to just 211 in the Forward Selection as compared to over a million models in the Best Subset method.

In Step 2 of the above algorithm, we identify the best model from among those p−k that augment Mk with one additional predictor. We can do this by simply choosing the model with the lowest RSS or the highest R2. However, in Step 3, we must identify the best model among a set of models with different numbers of variables. Forward stepwise selection is computationally better than the best subset selection.

Though forward stepwise seems to do well in practice, it does not always find the best possible model. For instance, suppose that in a given data set with p = 3 predictors, the best possible one-variable model contains X1, and the best possible two-variable model contains X2 and X3. Then forward stepwise selection will fail to select the best possible two-variable model, because M1 will contain X1, so M2 must also contain X1 together with one additional variable.

Backward Selection

Backward selection works somewhat opposite to the forward selection method. It begins with full least square model containing all p predictors, and then removes the least useful predictor, one at a time. Let’s look at the algorithm below to understand the Backward Selection method.

Algorithm (Backward Selection)

Let Mp denote the full model, which contains all p predictors.
For k = p, p − 1,…, 1:
Consider all k models that contain all but one of the predictors in Mk, for a total of k − 1 predictors.
Choose the best among these k models, and call it Mk−1. Here best is defined as having the smallest RSS or highest R2.
Select a single best model from among M0,…,Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2.

The backward selection approach also searches through only 1+p(p+ 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection. Like forward stepwise selection, the backward selection is not guaranteed to give the best model containing a subset of the p predictors. Backward selection requires that the number of samples n is larger than the number of variables p so that the full model can be fit. In contrast, forward stepwise can be used even when n<p.

If you are new to Data Science and Machine Learning and wondering where to begin your journey from, do check the links below, where I have mentioned step by step method to learn Data Science, with lots of sources for you to choose from.

Data science from scratch

How to step into Data Science as a complete beginner

towardsdatascience.com

The link below will help you choose from top Data Science courses on Coursera. For more such blogs, stay tuned. Happy Learning.

Best Data Science Courses on Coursera

Find out which Coursera course is best suited for you this year

datasciencewithsan.medium.com