This lesson is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response.
A linear regression analysis assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable y is a linear function of a set of predictor variables x1, x2, ..., xn.
Recall the Advertising data (Advertising.csv). Consider the sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.
# Load the dataset
adv_data=read.csv('../Datasets/Advertising.csv')
names(adv_data)
head(adv_data)
# Plot Sales as a function of TV advertisement budget
TV_values = adv_data[,'TV']
Sales_values = adv_data[,'sales']
plot(TV_values, Sales_values, xlab = 'TV', ylab= 'Sales', cex.lab=2, cex.axis=1.2)
We can put multiple graphs in a single plot by setting some graphical parameters with the help of par() function. This allows us to consider how the value of Sales changes as a function of multiple type of advertisement media (TV, radio, newspapers).
TV = adv_data[,'TV']
sales = adv_data[,'sales']
radio = adv_data[,'radio']
newspaper = adv_data[,'newspaper']
par(mfrow = c(1,3)) # Multiple plots arranged in a 1 by 3 grid.
# Plot Sales as a function of TV advertisement budget
plot(TV, sales, cex.lab=2, cex.axis=1.2)
# Plot Sales as a function of Radio advertisement budget
plot(radio, sales, cex.lab=2,cex.axis=1.2)
title("Advertising data",cex.main = 2,font.main= 4, col.main= "blue")
# Plot Sales as a function of Newspaper advertisement budget
plot(newspaper,sales,cex.lab=2,cex.axis=1.2)
Is possible to quickly build a Linear Model by using the function 'lm()'.
?lm
x_values = 1:1000
y_values = 8.0125457*x_values + 42.11245863
mydata <- data.frame(
x = x_values,
y = y_values)
linear_model = lm(y~x, mydata)
linear_model
By summary() we can analyse the regression model of y_values versus x_values (perfect fitting in this case).
summary(linear_model)
plot(x_values, y_values)
abline(linear_model, col = "blue", lty = 1, lwd = 2) #lty: The line type (0=blank, 1=solid, 2=dashed, 3=dotted, ..., 6)
The model perfectly estimated the coefficient of 'x' and the intercept (Estimate column). However, this was a simple example where x and y data have a perfect linear relationship.
Try to add some noise, and to consider only a subset of data.
x_values = 1:1000
y_values = 8.0125457*x_values + 42.11245863
# Add some noise
noise = rnorm(1000, sd = 100)
y_values = y_values + noise
x_train = x_values[200:220]
y_train = y_values[200:220]
mydata <- data.frame(
x = x_train,
y = y_train)
linear_model = lm(y~x, mydata)
summary(linear_model)
plot(x_values, y_values)
points(x_train, y_train, col="blue")
abline(linear_model, col = "red", lty = 1, lwd = 3)
In the above plot, the training data are drawn in blue. The model is not able to estimate the linear relationship between x and y. Try to agument the number of training examples.
x_train = x_values[200:500]
y_train = y_values[200:500]
mydata <- data.frame(
x = x_train,
y = y_train)
linear_model = lm(y~x, mydata)
summary(linear_model)
plot(x_values, y_values)
points(x_train, y_train, col="blue")
abline(linear_model, col = "red", lty = 1, lwd = 3)
Now, fit a linear model to represent the relationship between sales and TV, Radio and Newspapers respectively.
tv_model = lm(sales ~ TV)
radio_model=lm(sales ~ radio)
news_model = lm(sales ~ newspaper)
par(mfrow = c(1,3))
plot(TV, sales, cex.lab = 2, cex.axis = 1)
abline(tv_model, col = "blue", lty = 1, lwd = 2) #lty: The line type (0=blank, 1=solid, 2=dashed, 3=dotted, ..., 6)
plot(radio, sales,cex.lab=2,cex.axis=1.2) #lwd: The line width.
abline(radio_model, col="blue", lty=1, lwd=2)
plot(newspaper, sales,cex.lab=2,cex.axis=1.2)
abline(news_model, col="blue", lty=1, lwd=2)
The variable TV seems to be the most correlated with sales. Let's see the correlation between different variables.
cor_matrix = cor(adv_data[,-1],
use="complete.obs") # skip the first column. use="complete.obs" deletes missing values.
cor_matrix
In the table above correlations coefficients between the possible pairs of variables are shown. However we can visualize such a correlation matrix. There are different ways for visualizing a correlation matrix in R software. We will use the following:
symnum() function
corrplot() function to plot a correlogram
The R function symnum() replaces correlation coefficients by symbols according to the level of the correlation. It takes the correlation matrix as an argument.
symnum(cor_matrix, abbr.colnames = FALSE)
As indicated in the legend, the correlation coefficients between 0 and 0.3 are replaced by a space (" “); correlation coefficients between 0.3 and 0.6 are replaced by”.“; etc …
The function cor() returns only the correlation coefficients between variables. We can use Hmisc R package to calculate the correlation p-values.
install.packages("Hmisc")
library("Hmisc")
cor_with_pvalues <- rcorr(as.matrix(adv_data[,-1]))
cor_with_pvalues
The output of the function rcorr() is a list containing the following elements :
r: the correlation matrix
n: the matrix of the number of observations used in analyzing each pair of variables
P: the p-values corresponding to the significance levels of correlations.
They can also be extracted individually.
# Extract the correlation coefficients
r = cor_with_pvalues$r
print(r)
# Extract the matrix of observations
n = cor_with_pvalues$n
print(n)
# Extract p-values
p = cor_with_pvalues$P
print(p)
The function corrplot(), in the package of the same name, creates a graphical display of a correlation matrix, highlighting the most correlated variables in a data table.
In this plot, correlation coefficients are colored according to the value.
# Install corrplot (one time)
#install.packages("corrplot")
# Import corrplot
library(corrplot)
corrplot(cor_matrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
#install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
my_data <- adv_data[,-1]
chart.Correlation(my_data, histogram=TRUE, pch=19)
Details of the above visualization:
The distribution of each variable is shown on the diagonal.
Below of the diagonal, the bivariate scatter plots with a fitted line are shown
Above the diagonal, the value of the correlation plus the significance level as stars. Each significance level is associated to a symbol: p-values(0, 0.001, 0.01, 0.05, 0.1, 1) = symbols(“*”, “”, “*”, “.”, " “)
Now observe the linear model that represents the variable sales as a (linear) function of the variable TV.
summary(tv_model)
TV coefficient B_1 is 0.047, this means that an increment of 1000 dollars in TV advertising budget is associated with an increas of 47 units of sales (0.047*1000 units).
In this case, the null hypothesis H0 is that the true coefficient is zero. If p-value is low, it's suggesting that it would be rare to get a result as unusual as this if the coefficient were really zero.
What are the confidence intervals for B_0 and B_1 ? (TODO at home)
names(tv_model)
res = summary(tv_model)
# Extract the Residual Standard Error
names(res)
RSE = res$sigma
print(paste("Residual Standard Error: ", format(RSE, digits = 2)))
mean_sales = mean(sales)
perc_error = RSE / mean_sales
print(paste("Relative error: ", format(perc_error, digits=2), " (relative deviation from true line)."))
Create 100 random (x,y) data points. The x values from an uniform distribution, the y values are generated by the following equation:
y = 2 + 3x + e
the values 'e' are generated from a Gaussian distribution with mean 0 and arbitrary standard deviation.
1) Plot the 100 generated data points and the function y = 2 + 3x
2) Fit a linear model and plot the inferred line.
3) Try to fit the model with different amounts of data provided to the regressor.
4) Observe how performances change as the standard deviation of the distribution of the variable 'e' increases.