Linear regression

Statistical Laboratory

Alessandro Ortis - University of Catania

This lesson is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response.

A linear regression analysis assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable y is a linear function of a set of predictor variables x1, x2, ..., xn.

Recall the Advertising data (Advertising.csv). Consider the sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

In [3]:
# Load the dataset
adv_data=read.csv('../Datasets/Advertising.csv')
names(adv_data)
head(adv_data)
  1. 'X'
  2. 'TV'
  3. 'radio'
  4. 'newspaper'
  5. 'sales'
XTVradionewspapersales
1 230.137.8 69.2 22.1
2 44.539.3 45.1 10.4
3 17.245.9 69.3 9.3
4 151.541.3 58.5 18.5
5 180.810.8 58.4 12.9
6 8.748.9 75.0 7.2
In [4]:
# Plot Sales as a function of TV advertisement budget
TV_values = adv_data[,'TV']
Sales_values = adv_data[,'sales']
plot(TV_values, Sales_values, xlab = 'TV', ylab= 'Sales', cex.lab=2, cex.axis=1.2)

We can put multiple graphs in a single plot by setting some graphical parameters with the help of par() function. This allows us to consider how the value of Sales changes as a function of multiple type of advertisement media (TV, radio, newspapers).

In [5]:
TV = adv_data[,'TV']
sales = adv_data[,'sales']
radio = adv_data[,'radio']
newspaper = adv_data[,'newspaper']

par(mfrow = c(1,3)) # Multiple plots arranged in a 1 by 3 grid.
# Plot Sales as a function of TV advertisement budget
plot(TV, sales, cex.lab=2, cex.axis=1.2)
# Plot Sales as a function of Radio advertisement budget
plot(radio, sales, cex.lab=2,cex.axis=1.2)
title("Advertising data",cex.main = 2,font.main= 4, col.main= "blue")
# Plot Sales as a function of Newspaper advertisement budget
plot(newspaper,sales,cex.lab=2,cex.axis=1.2)

Possible questions:

  1. Is there a relationship between advertising budget and sales?
  2. How strong is the relationship between advertising budget and sales?
  3. Which media contribute to sales?
  4. How accurately can we estimate the effect of each medium on sales?
  5. How accurately can we predict future sales?
  6. Is the relationship linear?
  7. Is there synergy (interaction) among the advertising media?

Is possible to quickly build a Linear Model by using the function 'lm()'.

In [1]:
?lm
In [7]:
x_values = 1:1000
y_values = 8.0125457*x_values + 42.11245863
mydata <-  data.frame(
   x = x_values, 
   y = y_values)

linear_model = lm(y~x, mydata)
linear_model
Call:
lm(formula = y ~ x, data = mydata)

Coefficients:
(Intercept)            x  
     42.112        8.013  

By summary() we can analyse the regression model of y_values versus x_values (perfect fitting in this case).

In [8]:
summary(linear_model)
plot(x_values, y_values)
abline(linear_model, col = "blue", lty = 1, lwd = 2) #lty: The line type (0=blank, 1=solid, 2=dashed, 3=dotted, ..., 6)
Warning message in summary.lm(linear_model):
"essentially perfect fit: summary may be unreliable"
Call:
lm(formula = y ~ x, data = mydata)

Residuals:
       Min         1Q     Median         3Q        Max 
-5.818e-11 -1.050e-13  1.150e-13  2.870e-13  7.840e-13 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 4.211e+01  1.238e-13 3.403e+14   <2e-16 ***
x           8.013e+00  2.142e-16 3.741e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.955e-12 on 998 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 1.399e+33 on 1 and 998 DF,  p-value: < 2.2e-16

The model perfectly estimated the coefficient of 'x' and the intercept (Estimate column). However, this was a simple example where x and y data have a perfect linear relationship.

Try to add some noise, and to consider only a subset of data.

In [5]:
x_values = 1:1000
y_values = 8.0125457*x_values + 42.11245863

# Add some noise
noise = rnorm(1000, sd = 100)
y_values = y_values + noise

x_train = x_values[200:220]
y_train = y_values[200:220]
mydata <-  data.frame(
   x = x_train, 
   y = y_train)

linear_model = lm(y~x, mydata)
summary(linear_model)
plot(x_values, y_values)
points(x_train, y_train, col="blue")
abline(linear_model, col = "red", lty = 1, lwd = 3)
Call:
lm(formula = y ~ x, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-149.814  -75.628    8.174   64.598  150.008 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -293.996    657.386  -0.447  0.65977   
x              9.706      3.129   3.102  0.00587 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 86.83 on 19 degrees of freedom
Multiple R-squared:  0.3362,	Adjusted R-squared:  0.3012 
F-statistic: 9.622 on 1 and 19 DF,  p-value: 0.005869

In the above plot, the training data are drawn in blue. The model is not able to estimate the linear relationship between x and y. Try to agument the number of training examples.

In [6]:
x_train = x_values[200:500]
y_train = y_values[200:500]
mydata <-  data.frame(
   x = x_train, 
   y = y_train)

linear_model = lm(y~x, mydata)
summary(linear_model)
plot(x_values, y_values)
points(x_train, y_train, col="blue")
abline(linear_model, col = "red", lty = 1, lwd = 3)
Call:
lm(formula = y ~ x, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-274.691  -55.660   -0.448   60.841  252.695 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 67.19649   23.10676   2.908  0.00391 ** 
x            7.96636    0.06407 124.330  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.59 on 299 degrees of freedom
Multiple R-squared:  0.981,	Adjusted R-squared:  0.981 
F-statistic: 1.546e+04 on 1 and 299 DF,  p-value: < 2.2e-16

Now, fit a linear model to represent the relationship between sales and TV, Radio and Newspapers respectively.

In [11]:
tv_model = lm(sales ~ TV)
radio_model=lm(sales ~ radio)
news_model = lm(sales ~ newspaper)

par(mfrow = c(1,3))
plot(TV, sales, cex.lab = 2, cex.axis = 1)
abline(tv_model, col = "blue", lty = 1, lwd = 2) #lty: The line type (0=blank, 1=solid, 2=dashed, 3=dotted, ..., 6)
plot(radio, sales,cex.lab=2,cex.axis=1.2)        #lwd: The line width.
abline(radio_model, col="blue", lty=1, lwd=2)
plot(newspaper, sales,cex.lab=2,cex.axis=1.2)
abline(news_model, col="blue", lty=1, lwd=2)

Correlation

The variable TV seems to be the most correlated with sales. Let's see the correlation between different variables.

In [12]:
cor_matrix = cor(adv_data[,-1],
                 use="complete.obs") # skip the first column. use="complete.obs" deletes missing values.
cor_matrix
TVradionewspapersales
TV1.000000000.054808660.056647870.7822244
radio0.054808661.000000000.354103750.5762226
newspaper0.056647870.354103751.000000000.2282990
sales0.782224420.576222570.228299031.0000000

In the table above correlations coefficients between the possible pairs of variables are shown. However we can visualize such a correlation matrix. There are different ways for visualizing a correlation matrix in R software. We will use the following:

  1. symnum() function

  2. corrplot() function to plot a correlogram

symnum()

The R function symnum() replaces correlation coefficients by symbols according to the level of the correlation. It takes the correlation matrix as an argument.

In [13]:
symnum(cor_matrix, abbr.colnames = FALSE)
          TV radio newspaper sales
TV        1                       
radio        1                    
newspaper    .     1              
sales     ,  .               1    
attr(,"legend")
[1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (" “); correlation coefficients between 0.3 and 0.6 are replaced by”.“; etc …

Correlation matrix with p-values

The function cor() returns only the correlation coefficients between variables. We can use Hmisc R package to calculate the correlation p-values.

In [18]:
install.packages("Hmisc")
also installing the dependencies 'colorspace', 'mime', 'nlme', 'plyr', 'farver', 'labeling', 'munsell', 'lifecycle', 'stringi', 'highr', 'markdown', 'yaml', 'xfun', 'Matrix', 'lazyeval', 'MASS', 'mgcv', 'reshape2', 'scales', 'viridisLite', 'withr', 'RColorBrewer', 'stringr', 'knitr', 'checkmate', 'htmlwidgets', 'rstudioapi', 'survival', 'Formula', 'ggplot2', 'latticeExtra', 'cluster', 'rpart', 'nnet', 'acepack', 'foreign', 'gtable', 'gridExtra', 'data.table', 'htmlTable', 'viridis'

package 'colorspace' successfully unpacked and MD5 sums checked
package 'mime' successfully unpacked and MD5 sums checked
package 'nlme' successfully unpacked and MD5 sums checked
package 'plyr' successfully unpacked and MD5 sums checked
package 'farver' successfully unpacked and MD5 sums checked
package 'labeling' successfully unpacked and MD5 sums checked
package 'munsell' successfully unpacked and MD5 sums checked
package 'lifecycle' successfully unpacked and MD5 sums checked
package 'stringi' successfully unpacked and MD5 sums checked
package 'highr' successfully unpacked and MD5 sums checked
package 'markdown' successfully unpacked and MD5 sums checked
package 'yaml' successfully unpacked and MD5 sums checked
package 'xfun' successfully unpacked and MD5 sums checked
package 'Matrix' successfully unpacked and MD5 sums checked
package 'lazyeval' successfully unpacked and MD5 sums checked
package 'MASS' successfully unpacked and MD5 sums checked
package 'mgcv' successfully unpacked and MD5 sums checked
package 'reshape2' successfully unpacked and MD5 sums checked
package 'scales' successfully unpacked and MD5 sums checked
package 'viridisLite' successfully unpacked and MD5 sums checked
package 'withr' successfully unpacked and MD5 sums checked
package 'RColorBrewer' successfully unpacked and MD5 sums checked
package 'stringr' successfully unpacked and MD5 sums checked
package 'knitr' successfully unpacked and MD5 sums checked
package 'checkmate' successfully unpacked and MD5 sums checked
package 'htmlwidgets' successfully unpacked and MD5 sums checked
package 'rstudioapi' successfully unpacked and MD5 sums checked
package 'survival' successfully unpacked and MD5 sums checked
package 'Formula' successfully unpacked and MD5 sums checked
package 'ggplot2' successfully unpacked and MD5 sums checked
package 'latticeExtra' successfully unpacked and MD5 sums checked
package 'cluster' successfully unpacked and MD5 sums checked
package 'rpart' successfully unpacked and MD5 sums checked
package 'nnet' successfully unpacked and MD5 sums checked
package 'acepack' successfully unpacked and MD5 sums checked
package 'foreign' successfully unpacked and MD5 sums checked
package 'gtable' successfully unpacked and MD5 sums checked
package 'gridExtra' successfully unpacked and MD5 sums checked
package 'data.table' successfully unpacked and MD5 sums checked
package 'htmlTable' successfully unpacked and MD5 sums checked
package 'viridis' successfully unpacked and MD5 sums checked
package 'Hmisc' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Alessandro Ortis\AppData\Local\Temp\RtmpKKAniC\downloaded_packages
In [20]:
library("Hmisc")
cor_with_pvalues <- rcorr(as.matrix(adv_data[,-1]))
cor_with_pvalues
            TV radio newspaper sales
TV        1.00  0.05      0.06  0.78
radio     0.05  1.00      0.35  0.58
newspaper 0.06  0.35      1.00  0.23
sales     0.78  0.58      0.23  1.00

n= 200 


P
          TV     radio  newspaper sales 
TV               0.4408 0.4256    0.0000
radio     0.4408        0.0000    0.0000
newspaper 0.4256 0.0000           0.0011
sales     0.0000 0.0000 0.0011          

The output of the function rcorr() is a list containing the following elements :

  • r: the correlation matrix

  • n: the matrix of the number of observations used in analyzing each pair of variables

  • P: the p-values corresponding to the significance levels of correlations.

They can also be extracted individually.

In [22]:
# Extract the correlation coefficients
r = cor_with_pvalues$r
print(r)

# Extract the matrix of observations
n = cor_with_pvalues$n
print(n)

# Extract p-values
p = cor_with_pvalues$P
print(p)
                  TV      radio  newspaper     sales
TV        1.00000000 0.05480866 0.05664787 0.7822244
radio     0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales     0.78222442 0.57622257 0.22829903 1.0000000
           TV radio newspaper sales
TV        200   200       200   200
radio     200   200       200   200
newspaper 200   200       200   200
sales     200   200       200   200
                 TV        radio    newspaper       sales
TV               NA 4.408061e-01 4.256018e-01 0.000000000
radio     0.4408061           NA 2.688835e-07 0.000000000
newspaper 0.4256018 2.688835e-07           NA 0.001148196
sales     0.0000000 0.000000e+00 1.148196e-03          NA

corrplot()

The function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.

In this plot, correlation coefficients are colored according to the value.

In [14]:
# Install corrplot (one time)
#install.packages("corrplot")
# Import corrplot
library(corrplot)
corrplot 0.84 loaded
In [15]:
corrplot(cor_matrix, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.

In [1]:
#install.packages("PerformanceAnalytics")
also installing the dependencies 'lattice', 'xts', 'quadprog', 'zoo'

package 'lattice' successfully unpacked and MD5 sums checked
package 'xts' successfully unpacked and MD5 sums checked
package 'quadprog' successfully unpacked and MD5 sums checked
package 'zoo' successfully unpacked and MD5 sums checked
package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Alessandro Ortis\AppData\Local\Temp\RtmpKKAniC\downloaded_packages
In [17]:
library("PerformanceAnalytics")
my_data <- adv_data[,-1]
chart.Correlation(my_data, histogram=TRUE, pch=19)

Details of the above visualization:

  • The distribution of each variable is shown on the diagonal.

  • Below of the diagonal, the bivariate scatter plots with a fitted line are shown

  • Above the diagonal, the value of the correlation plus the significance level as stars. Each significance level is associated to a symbol: p-values(0, 0.001, 0.01, 0.05, 0.1, 1) = symbols(“*”, “”, “*”, “.”, " “)

Linear model of TV vs Sales

Now observe the linear model that represents the variable sales as a (linear) function of the variable TV.

In [12]:
summary(tv_model)
Call:
lm(formula = sales ~ TV)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

TV coefficient B_1 is 0.047, this means that an increment of 1000 dollars in TV advertising budget is associated with an increas of 47 units of sales (0.047*1000 units).

In this case, the null hypothesis H0 is that the true coefficient is zero. If p-value is low, it's suggesting that it would be rare to get a result as unusual as this if the coefficient were really zero.

Question:

What are the confidence intervals for B_0 and B_1 ? (TODO at home)

In [50]:
names(tv_model)
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'
In [6]:
res = summary(tv_model)
In [13]:
# Extract the Residual Standard Error
names(res)
RSE = res$sigma
print(paste("Residual Standard Error: ", format(RSE, digits = 2)))
mean_sales = mean(sales)
perc_error = RSE / mean_sales
print(paste("Relative error: ", format(perc_error, digits=2), " (relative deviation from true line)."))
  1. 'call'
  2. 'terms'
  3. 'residuals'
  4. 'coefficients'
  5. 'aliased'
  6. 'sigma'
  7. 'df'
  8. 'r.squared'
  9. 'adj.r.squared'
  10. 'fstatistic'
  11. 'cov.unscaled'
[1] "Residual Standard Error:  3.3"
[1] "Relative error:  0.23  (relative deviation from true line)."

Exercise

Create 100 random (x,y) data points. The x values from an uniform distribution, the y values are generated by the following equation:

y = 2 + 3x + e

the values 'e' are generated from a Gaussian distribution with mean 0 and arbitrary standard deviation.

1) Plot the 100 generated data points and the function y = 2 + 3x

2) Fit a linear model and plot the inferred line.

3) Try to fit the model with different amounts of data provided to the regressor.

4) Observe how performances change as the standard deviation of the distribution of the variable 'e' increases.