Model assessment and variable selection

Whenever we add a variable in our model, we are not only adding information but also noise that clutter the information and makes the analysis difficult. Simpler model is always better since they contain less noise and they are easy to interpret. In real life, things are not that simple. But relax, there are statistical methods that make model assessment and perform variable selection and gives optimal set of variables for us.

Here, I will discuss how to compare model to know which one is better than other. This model assessment help us to select a model with few variable that can perform as good as a model with large number of variables. Here comes the idea of variable selection. I will try to explain these concepts with an example. For the example I will use mtcars data set which is available in R by default. Following are two well know procedure of selecting subset of variables,

Best subset method

Best subset procedure selects best regression model by running all possible subset of variables. When number of variable is large, the possible combinations of candidate subset become huge. For example, for 2 variable case (\(X_1\) and \(X_2\)), there are 4 possible subsets – 1 with no predictors, 2 with single predictors and 1 with both predictors. Similarly, if there are 4 variable case (\(X_1, X_2, X_3\) and \(X_4\)), there will be 16 possible subset. In general, fitting all possible subset of large number of predictors becomes computationally intensive.

Once, all the possible model is fitted, they are compared based on some criteria. This model assessment may based on various criteria such as coefficient of determination (\(R^2\)), adjusted \(R^2\), Mallow’s CP, AIC and BIC. In R, leaps package can be used for performing this operation. Lets dig into our mtcars example.

Fitting a complete model:

full.model <- lm(mpg ~ ., data = mtcars)
smry <- summary(full.model)
smry

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Here, we can see that the model has explained almost 86.9 percent of variation present in mpg, but non of the predictors are significant. This is a hint of having unnecessary variables that has increased model error. Using regsubsets function from leaps package, we can select a subset of predictors based on some criteria.

library(leaps)
best.subset <-
  regsubsets(
    x      = mtcars[, -1],     # predictor variables
    y      = mtcars[, 1],      # response variable (mpg)
    nbest  = 1,                # top 1 best model
    nvmax  = ncol(mtcars) - 1, # max. number of variable (all)
    method = "exhaustive"      # search all possible subset
  )
bs.smry <- summary(best.subset)

We can combine following summary output with a plot created from additional estimates to get some insight. These estimates are also found in the summary object. The output show which variables are included with a star(*).

pander::pander(bs.smry$outmat)
bs.est <- data.frame(
  nvar   = 1:(best.subset$nvmax - 1),
  adj.r2 = round(bs.smry$adjr2, 3),
  cp     = round(bs.smry$cp, 3),
  bic    = round(bs.smry$bic, 3)
)
bs.est <- tidyr::gather(bs.est, "estimates", "value", -nvar)
  cyl disp hp drat wt qsec vs am gear carb
1 ( 1 ) *
2 ( 1 ) * *
3 ( 1 ) * * *
4 ( 1 ) * * * *
5 ( 1 ) * * * * *
6 ( 1 ) * * * * * *
7 ( 1 ) * * * * * * *
8 ( 1 ) * * * * * * * *
9 ( 1 ) * * * * * * * * *
10 ( 1 ) * * * * * * * * * *
We can make a plot to visualise the properties of these individual models and select a model with specific number of predictor that can give minimum BIC, or minimum CP or maximum adjusted rsquared.
library(ggplot2)
library(dplyr)
bs.est.select <- bs.est %>%
  group_by(estimates) %>%
  filter(
    (value == max(value) & estimates == "adj.r2") |
    (value == min(value) & estimates != "adj.r2")
  )
assessment.plt <- ggplot(bs.est, aes(nvar, value, color = estimates)) +
  geom_point(shape = 21, fill = "lightgray") +
  geom_line() +
  facet_wrap(~estimates, scale = 'free_y') +
  theme(legend.position = "top") +
  labs(x = "Number of variables in the model",
       y = "Value of Estimate") +
  scale_x_continuous(breaks = seq(0, 10, 2)) +
  geom_point(data = bs.est.select, fill = "red",
             shape = 21) +
  geom_text(aes(label = paste0("nvar:", nvar, '\n', "value:", value)),
            data = bs.est.select, 
            size = 3, hjust = 0, vjust = c(1, -1, -1),
            color = "black", family = "monospace")

From these plots, we see that with 5 variables we will obtain maximum adjusted coefficient of determination (\(R^2\)). Similarly, both BIC and Mallow CP will be minimum for models with only 3 predictor variables. With the help of table above, we can identify these variables. From the table, row corresponding to 3 variables, we see that the three predictors are wt, qsec and am. To obtain maximum adjusted \(R^2\), disp and hp should be added to the previous 3 predictors.

This way, we can reduce a model to few variables optimising different assessment criteria. Let look at the fit of these reduced models:

model.3 <- lm(mpg ~ wt + qsec + am, data = mtcars)
model.5 <- update(model.3, . ~ . + disp + hp)

Summaries of the models

3 Variable Model

summary(model.3)

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
qsec          1.2259     0.2887   4.247 0.000216 ***
am            2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

5 Variable Model

summary(model.5)

Call:
lm(formula = mpg ~ wt + qsec + am + disp + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5399 -1.7398 -0.3196  1.1676  4.5534 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 14.36190    9.74079   1.474  0.15238   
wt          -4.08433    1.19410  -3.420  0.00208 **
qsec         1.00690    0.47543   2.118  0.04391 * 
am           3.47045    1.48578   2.336  0.02749 * 
disp         0.01124    0.01060   1.060  0.29897   
hp          -0.02117    0.01450  -1.460  0.15639   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.429 on 26 degrees of freedom
Multiple R-squared:  0.8637,    Adjusted R-squared:  0.8375 
F-statistic: 32.96 on 5 and 26 DF,  p-value: 1.844e-10

From these output, it seems that although adjusted \(R^2\) has increased in later model, the additional variables are not significant. we can compare these two model with an ANOVA test which compares the residual variance between these two models. We can write the hypothesis as,

\(H_0:\) Model 1 and Model 2 are same vs \(H_1:\) Model 1 and Model 2 are different

where, Model 1 and Model 2 represents 3 variable and 5 variable model

anova(model.3, model.5)
Analysis of Variance Table

Model 1: mpg ~ wt + qsec + am
Model 2: mpg ~ wt + qsec + am + disp + hp
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     28 169.29                           
2     26 153.44  2    15.848 1.3427 0.2786

The ANOVA result can not reject the hypothesis so claim that Model 1 and Model 2 are same. So, it is better to select the simpler model with 3 predictor variables.

Step-wise selection

When the number of increases, a exhaustive search of all possible subset is computationally intensive. This disadvantage can be overcome by using step-wise selection procedure. A step-wise variable selection procedure can be,

Forward Selection Procedure
In this procedure …
Backward Elimination Procedure
Here …