Data observation and preliminar considerations. The first step is to observe the type and amount of available features, and make hypothesis about possible connections with respect to the nature of the problem to be addressed.
The Boston data frame has 506 rows and 14 columns (predictors). We have descriptions and summaries of predictors as follow:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $ 10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
#install.packages("MASS") # only if you don't have the MASS lib installed
library(MASS)
attach(Boston) # after this command all the variables related to Boston dataset are set
# Thanks to the 'attach()' function, I have set one variable for each column of the dataset
length(age)
class(crim)
# Without 'attach()', I should create each single variable as follows...
crim = Boston[,1]
zn = Boston[,2]
# ecc...
To find out more about the dataset we can type '?Boston'.
?Boston
head(Boston)
First, we can summarise the main statistical indices of each variable.
summary(Boston)
A more intuitive visualization of the feature distributions is given by the histograms. We are going to visualize both the histogram of each feature and an estimation of its distribution.
To do so, we will combine two commands:
hist(): this function computes and visualizes the histogram of the data. In particular, we will visualize the histogram of the relative frequencies, i.e. the counts are divided by the total. In this way all the values will be between 0 and 1 (normalized). This behaviour is obtained by setting the parameter 'freq' to FALSE (see the documentation for more details).
density(): this function performs a kernel density estimation of the data. The output is a model of the data distribution built by considering a mixture model. This is useful for continuous variables.
hist(nox, freq=FALSE)
lines(density(nox), col="red")
To manage multiple plots we can use the par() function.
par(mfrow=c(2,2)) # set the plotting area into a 2*2 array
# First plot
hist(chas, freq=FALSE)
lines(density(chas), col="red")
# Second plot
hist(crim, freq=FALSE)
lines(density(crim), col="red")
# Third plot
hist(zn, freq=FALSE)
lines(density(zn), col="red")
# Fourth plot
hist(indus, freq=FALSE)
lines(density(indus), col="red")
par(mfrow=c(4,2)) # set the plotting area into a 4*2 array
# First plot
hist(rm, freq=FALSE)
lines(density(rm), col="red")
# Second plot
hist(age, freq=FALSE)
lines(density(age), col="red")
# Third plot
hist(dis, freq=FALSE)
lines(density(dis), col="red")
# Fourth plot
hist(rad, freq=FALSE)
lines(density(rad), col="red")
hist(tax, freq=FALSE)
lines(density(tax), col="red")
hist(ptratio, freq=FALSE)
lines(density(ptratio), col="red")
hist(black, freq=FALSE)
lines(density(black), col="red")
hist(medv, freq=FALSE)
lines(density(medv), col="red")
If we are interested in know more details about a specific distribution, is possible to print the indices related to the density estimation function.
?density
density(nox)
Is possible to perform several statistical inferences over such data. In particular, we can be interested in predicting the Median House Value (medv), considering all or sub-groups of the 13 available features (i.e., predictors).
Then, it worth to give a look at the medv variable.
density(medv)
With this specific dataset, we maybe interested to model a predictor for the houses' prices (medv). Therefore, start by focusing on the distribution of the medv values.
par(mfrow=c(1,2))
hist(medv, freq=FALSE)
lines(density(medv), col="red")
boxplot(medv)
quantile(medv, 0.25)
quantile(medv, 0.75)
We see that the values of medv are distributed normally with few outliers with values between 35 and 55.
In order to select proper predictors, we can first compute the pair-wise correlations.
#install.packages('corrplot')
library(corrplot)
cov_matrix <- cov(Boston)
cov_matrix
cor_matrix <- cor(Boston)
corrplot(cor_matrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
The variable 'medv' has an high correlation with the variable 'rm' (average number of rooms per dwelling) and a negative correlation with 'lstat' (lower status of population). Since we are interested in predicting the variable 'medv', consider only the correlations in the 'medv' column of the correlation matrix.
cor_matrix[,'medv']
The correlation values suggest that there may be a negative correlation between 'lstat' and 'medv'. Therefore, we will fit a linear regressor exploiting the predictor 'lstat'.
#Note: without the above 'attach()', this would rise an error.
medv_lstat = lm(medv~lstat)
summary(medv_lstat)
# Extract the coefficient of the model
coef(medv_lstat)
# Extract the confidence interval for the coefficient estimates
# (lower and upper confidence limits for each parameter)
confint(medv_lstat)
Note that the 'lstat' confidence interval is close to zero, even if zero is not included.
We can use 'names' to find out the information that can be extracted from the fitted model, and use the '$' operator to extract them.
names(medv_lstat)
medv_lstat$coefficients
The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of the 'medv' value for a given 'lstat'.
In the following example, we input three values of lstat (5, 10 and 15) and compute the model confidence intervals and prediction intervals.
new_input_data = data.frame(
lstat = (c(5,10,15)) )
# confidence intervals
predict(medv_lstat, new_input_data, interval="confidence")
# prediction intervals
predict(medv_lstat, new_input_data, interval="prediction")
For instance, the 95% confidence interval associated with a lstat value of 10 is (24.47, 25.63), and the 95% prediction interval is (12.828, 37.28). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.05 for medv when lstat equals 10), but the latter are substantially wider (in general the prediction interval is wider than the confidence interval).
We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.
plot(lstat,medv)
abline(medv_lstat, col="red", lwd=3)
The Boston data set contains thirteen variables, to perform a regression using all of the predictor we can use the following short-hand.
medv_all = lm(medv~., data = Boston)
summary(medv_all)
# Is possible to extract each single output from the summary
names(summary(medv_all))
summary(medv_all)$r.sq
summary(medv_all)$sigma
summary(medv_all)$fstatistic
# is easier if we first assign it to a variable
summ_out= summary(medv_all)
summ_out$r.sq
summ_out$sigma
summ_out$fstatistic
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, the predictors 'age' and 'indus' have high p-values (i.e., their true coefficients are likely to be zero).
So we may want to run a regression excluding these predictors. The following syntax results in a regression using all predictors except age and indus.
medv_exAgeIndus=lm(medv~.-age-indus, data= Boston)
summary(medv_exAgeIndus)
# Scale the data for PCA (standardize)
boston_scaled = scale(Boston)
# Apply PCA
pca_boston = prcomp(boston_scaled)
names(pca_boston)
# Components (all columns)
head(pca_boston$x)
# Components vs. variables (i.e., variables' coefficients)
pca_boston$rotation
?prcomp
# Plot the first two components
pca_scores = data.frame(pca_boston$x[,1:2])
plot(pca_scores)
data = pca_boston$x[,1:2]
# K-Means Clustering with k clusters
k = 3
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])
k = 4
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])
k = 5
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])
The elbow method consists of plotting the average ''sparsity'' of a clustering process (usually computed as a sum of squared differences within clusters - WCSS) with respect to the number of clusters, and picking the ''elbow'' of the function as the best number of clusters to use.
Try to implement the elbow method to determine the best choice for k, representing the number of clusters to be defined by the kmeans model. Then, perform the clustering analysis usign the k value.
Hint: if the analysis is correct, the elbow plot will be a not-increasing function.
names(fit)
max_k = 50
dist_sequence = c()
for(k in seq(max_k)){
tot_distance = c()
fit <- kmeans(data, k)
labels = fit$cluster
centroids = fit$centers
for(i in seq(k)){
cent_x = centroids[i,1]
cent_y = centroids[i,2]
# within cluster sum of differences
cluster_data = data[which(labels==i),]
for (j in seq(nrow(cluster_data))){
dist = sqrt((cluster_data[j,1] - cent_x)^2 + (cluster_data[j,2] - cent_y)^2)
tot_distance = c(tot_distance, dist)
}
}
dist_sequence = c(dist_sequence, mean(tot_distance)) # here we compute the mean sum of squared differences of the cluster
}
plot(seq(max_k),dist_sequence,
xlab = '# of clusters',
ylab = 'WCSS',
type = "l", lty = 1)
# TODO: add also the BCSS to this plot