Analysis Example: Boston Houses

Statistical Laboratory

Alessandro Ortis - University of Catania

Step 1: feature analysis

Data observation and preliminar considerations. The first step is to observe the type and amount of available features, and make hypothesis about possible connections with respect to the nature of the problem to be addressed.

The Boston data frame has 506 rows and 14 columns (predictors). We have descriptions and summaries of predictors as follow:

  • crim: per capita crime rate by town.

  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus: proportion of non-retail business acres per town.

  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox: nitrogen oxides concentration (parts per 10 million).

  • rm: average number of rooms per dwelling.

  • age: proportion of owner-occupied units built prior to 1940.

  • dis: weighted mean of distances to five Boston employment centres.

  • rad: index of accessibility to radial highways.

  • tax: full-value property-tax rate per $ 10,000.

  • ptratio: pupil-teacher ratio by town.

  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat: lower status of the population (percent).

  • medv: median value of owner-occupied homes in $1000s.

In [1]:
#install.packages("MASS") # only if you don't have the MASS lib installed
library(MASS)
attach(Boston) # after this command all the variables related to Boston dataset are set
Warning message:
"package 'MASS' was built under R version 3.6.3"
In [2]:
# Thanks to the 'attach()' function, I have set one variable for each column of the dataset
length(age)
class(crim)
506
'numeric'
In [3]:
# Without 'attach()', I should create each single variable as follows...
crim = Boston[,1]
zn = Boston[,2]
# ecc...

To find out more about the dataset we can type '?Boston'.

In [5]:
?Boston
In [5]:
head(Boston)
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

First, we can summarise the main statistical indices of each variable.

In [6]:
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

A more intuitive visualization of the feature distributions is given by the histograms. We are going to visualize both the histogram of each feature and an estimation of its distribution.

To do so, we will combine two commands:

  • hist(): this function computes and visualizes the histogram of the data. In particular, we will visualize the histogram of the relative frequencies, i.e. the counts are divided by the total. In this way all the values will be between 0 and 1 (normalized). This behaviour is obtained by setting the parameter 'freq' to FALSE (see the documentation for more details).

  • density(): this function performs a kernel density estimation of the data. The output is a model of the data distribution built by considering a mixture model. This is useful for continuous variables.

In [7]:
hist(nox, freq=FALSE)
lines(density(nox), col="red")

To manage multiple plots we can use the par() function.

In [8]:
par(mfrow=c(2,2))    # set the plotting area into a 2*2 array

# First plot
hist(chas, freq=FALSE)
lines(density(chas), col="red")
# Second plot
hist(crim, freq=FALSE)
lines(density(crim), col="red")
# Third plot
hist(zn, freq=FALSE)
lines(density(zn), col="red")
# Fourth plot
hist(indus, freq=FALSE)
lines(density(indus), col="red")
In [9]:
par(mfrow=c(4,2))    # set the plotting area into a 4*2 array

# First plot
hist(rm, freq=FALSE)
lines(density(rm), col="red")
# Second plot
hist(age, freq=FALSE)
lines(density(age), col="red")
# Third plot
hist(dis, freq=FALSE)
lines(density(dis), col="red")
# Fourth plot
hist(rad, freq=FALSE)
lines(density(rad), col="red")

hist(tax, freq=FALSE)
lines(density(tax), col="red")

hist(ptratio, freq=FALSE)
lines(density(ptratio), col="red")

hist(black, freq=FALSE)
lines(density(black), col="red")

hist(medv, freq=FALSE)
lines(density(medv), col="red")

If we are interested in know more details about a specific distribution, is possible to print the indices related to the density estimation function.

In [8]:
?density
In [6]:
density(nox)
Call:
	density.default(x = nox)

Data: nox (506 obs.);	Bandwidth 'bw' = 0.03002

       x                y           
 Min.   :0.2949   Min.   :0.002129  
 1st Qu.:0.4615   1st Qu.:0.270183  
 Median :0.6280   Median :1.585293  
 Mean   :0.6280   Mean   :1.499679  
 3rd Qu.:0.7945   3rd Qu.:2.566825  
 Max.   :0.9611   Max.   :3.594489  

Is possible to perform several statistical inferences over such data. In particular, we can be interested in predicting the Median House Value (medv), considering all or sub-groups of the 13 available features (i.e., predictors).

Then, it worth to give a look at the medv variable.

In [7]:
density(medv)
Call:
	density.default(x = medv)

Data: medv (506 obs.);	Bandwidth 'bw' = 1.542

       x                 y            
 Min.   : 0.3745   Min.   :1.383e-05  
 1st Qu.:13.9373   1st Qu.:3.813e-03  
 Median :27.5000   Median :1.112e-02  
 Mean   :27.5000   Mean   :1.841e-02  
 3rd Qu.:41.0628   3rd Qu.:2.713e-02  
 Max.   :54.6255   Max.   :6.646e-02  

With this specific dataset, we maybe interested to model a predictor for the houses' prices (medv). Therefore, start by focusing on the distribution of the medv values.

In [9]:
par(mfrow=c(1,2)) 

hist(medv, freq=FALSE)
lines(density(medv), col="red")

boxplot(medv)
quantile(medv, 0.25)
quantile(medv, 0.75)
25%: 17.025
75%: 25

We see that the values of medv are distributed normally with few outliers with values between 35 and 55.

In order to select proper predictors, we can first compute the pair-wise correlations.

In [10]:
#install.packages('corrplot')
library(corrplot)
corrplot 0.92 loaded
In [11]:
cov_matrix <- cov(Boston)
cov_matrix
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
crim 73.9865782 -40.2159560 23.9923388 -0.122108643 0.419593894 -1.32503785 85.4053223 -6.87672154 46.84776101 844.821538 5.39933079 -302.381816 27.98616788 -30.7185080
zn -40.2159560 543.9368137 -85.4126481 -0.252925293 -1.396148200 5.11251341 -373.9015482 32.62930405 -63.34869487-1236.453735 -19.77657066 373.721402 -68.78303690 77.3151755
indus 23.9923388 -85.4126481 47.0644425 0.109668806 0.607073693 -1.88795657 124.5139031 -10.22809746 35.54997135 833.360290 5.69210400 -223.579756 29.58027028 -30.5208228
chas -0.1221086 -0.2529253 0.1096688 0.064512973 0.002684303 0.01628475 0.6185712 -0.05304296 -0.01629554 -1.523367 -0.06681916 1.131325 -0.09781626 0.4094095
nox 0.4195939 -1.3961482 0.6070737 0.002684303 0.013427636 -0.02460345 2.3859272 -0.18769584 0.61692945 13.046286 0.04739732 -4.020570 0.48894617 -0.4554124
rm -1.3250378 5.1125134 -1.8879566 0.016284745 -0.024603450 0.49367085 -4.7519292 0.30366342 -1.28381457 -34.583448 -0.54076322 8.215006 -3.07974141 4.4934459
age 85.4053223 -373.9015482 124.5139031 0.618571205 2.385927202 -4.75192919 792.3583985 -44.32937946 111.77084648 2402.690122 15.93692134 -702.940328 121.07772456 -97.5890166
dis -6.8767215 32.6293041 -10.2280975 -0.053042959 -0.187695836 0.30366342 -44.3293795 4.43401514 -9.06825201 -189.664592 -1.05977455 56.040356 -7.47332906 4.8402286
rad 46.8477610 -63.3486949 35.5499714 -0.016295543 0.616929453 -1.28381457 111.7708465 -9.06825201 75.81636598 1335.756577 8.76071616 -353.276219 30.38544241 -30.5612280
tax 844.8215381 -1236.4537354 833.3602902 -1.523367119 13.046285530 -34.58344778 2402.6901225 -189.664591731335.7565765328404.759488 168.15314053 -6797.911215 654.71451963-726.2557164
ptratio 5.3993308 -19.7765707 5.6921040 -0.066819160 0.047397324 -0.54076322 15.9369213 -1.05977455 8.76071616 168.153141 4.68698912 -35.059527 5.78272856 -10.1106571
black-302.3818163 373.7214023-223.5797555 1.131324541 -4.020569639 8.21500573 -702.9403283 56.04035576-353.27621939-6797.911215 -35.05952730 8334.752263 -238.66751554 279.9898338
lstat 27.9861679 -68.7830369 29.5802703 -0.097816264 0.488946168 -3.07974141 121.0777246 -7.47332906 30.38544241 654.714520 5.78272856 -238.667516 50.99475951 -48.4475383
medv -30.7185080 77.3151755 -30.5208228 0.409409463 -0.455412432 4.49344588 -97.5890166 4.84022864 -30.56122804 -726.255716 -10.11065714 279.989834 -48.44753832 84.5867236
In [12]:
cor_matrix <- cor(Boston)
corrplot(cor_matrix, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

The variable 'medv' has an high correlation with the variable 'rm' (average number of rooms per dwelling) and a negative correlation with 'lstat' (lower status of population). Since we are interested in predicting the variable 'medv', consider only the correlations in the 'medv' column of the correlation matrix.

In [16]:
cor_matrix[,'medv']
crim
-0.388304608586812
zn
0.360445342450543
indus
-0.483725160028373
chas
0.175260177190298
nox
-0.427320772373283
rm
0.695359947071539
age
-0.376954565004596
dis
0.249928734085904
rad
-0.381626230639778
tax
-0.468535933567767
ptratio
-0.507786685537562
black
0.333460819657067
lstat
-0.737662726174015
medv
1

The correlation values suggest that there may be a negative correlation between 'lstat' and 'medv'. Therefore, we will fit a linear regressor exploiting the predictor 'lstat'.

Step 2: regression analysis (medv)

In [13]:
#Note: without the above 'attach()', this would rise an error.
medv_lstat = lm(medv~lstat)
In [14]:
summary(medv_lstat)
Call:
lm(formula = medv ~ lstat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
In [19]:
# Extract the coefficient of the model
coef(medv_lstat)
(Intercept)
34.5538408793831
lstat
-0.950049353757991
In [20]:
# Extract the confidence interval for the coefficient estimates
# (lower and upper confidence limits for each parameter)
confint(medv_lstat)
2.5 %97.5 %
(Intercept)33.448457 35.6592247
lstat-1.026148 -0.8739505

Note that the 'lstat' confidence interval is close to zero, even if zero is not included.

We can use 'names' to find out the information that can be extracted from the fitted model, and use the '$' operator to extract them.

In [21]:
names(medv_lstat)
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'
In [22]:
medv_lstat$coefficients
(Intercept)
34.5538408793831
lstat
-0.950049353757991

Predictions

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of the 'medv' value for a given 'lstat'.

In the following example, we input three values of lstat (5, 10 and 15) and compute the model confidence intervals and prediction intervals.

In [23]:
new_input_data = data.frame(
                        lstat = (c(5,10,15)) )

# confidence intervals
predict(medv_lstat, new_input_data, interval="confidence")
fitlwrupr
29.8035929.0074130.59978
25.0533524.4741325.63256
20.3031019.7315920.87461
In [24]:
# prediction intervals
predict(medv_lstat, new_input_data, interval="prediction")
fitlwrupr
29.80359 17.56567542.04151
25.05335 12.82762637.27907
20.30310 8.07774232.52846

For instance, the 95% confidence interval associated with a lstat value of 10 is (24.47, 25.63), and the 95% prediction interval is (12.828, 37.28). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of 25.05 for medv when lstat equals 10), but the latter are substantially wider (in general the prediction interval is wider than the confidence interval).

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.

In [25]:
plot(lstat,medv)
abline(medv_lstat, col="red", lwd=3)

The Boston data set contains thirteen variables, to perform a regression using all of the predictor we can use the following short-hand.

In [26]:
medv_all = lm(medv~., data = Boston)
summary(medv_all)
Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
In [27]:
# Is possible to extract each single output from the summary
names(summary(medv_all))
summary(medv_all)$r.sq
summary(medv_all)$sigma
summary(medv_all)$fstatistic

# is easier if we first assign it to a variable
summ_out= summary(medv_all)
summ_out$r.sq
summ_out$sigma
summ_out$fstatistic
  1. 'call'
  2. 'terms'
  3. 'residuals'
  4. 'coefficients'
  5. 'aliased'
  6. 'sigma'
  7. 'df'
  8. 'r.squared'
  9. 'adj.r.squared'
  10. 'fstatistic'
  11. 'cov.unscaled'
0.740642664109409
4.74529818169963
value
108.076666174326
numdf
13
dendf
492
0.740642664109409
4.74529818169963
value
108.076666174326
numdf
13
dendf
492

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, the predictors 'age' and 'indus' have high p-values (i.e., their true coefficients are likely to be zero).

So we may want to run a regression excluding these predictors. The following syntax results in a regression using all predictors except age and indus.

In [28]:
medv_exAgeIndus=lm(medv~.-age-indus, data= Boston)
summary(medv_exAgeIndus)
Call:
lm(formula = medv ~ . - age - indus, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Step 3: PCA and clustering analysis

In [18]:
# Scale the data for PCA (standardize)
boston_scaled = scale(Boston)

# Apply PCA
pca_boston = prcomp(boston_scaled)
names(pca_boston)
  1. 'sdev'
  2. 'rotation'
  3. 'center'
  4. 'scale'
  5. 'x'
In [19]:
# Components (all columns)
head(pca_boston$x)
PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12PC13PC14
-2.085280 0.4923660 -0.33565863 0.02806062 1.01179950 -0.2618327 0.3275539 -0.16005808 -0.4706823 -0.2055695943-0.77929505 -0.1095206 -0.4941907 0.24768597
-1.372024 -0.1707548 -0.96500929 0.43197898 0.25439282 0.3034788 0.5585569 0.28836506 -0.1956362 -0.2459951705-0.27726872 0.5881189 -0.1129447 -0.11300499
-2.374204 0.9131235 -0.08993678 1.12280202 -0.03275573 0.5083969 0.4870519 -0.08240874 0.0541731 -0.1948119933 0.02887660 0.4160065 0.3566767 0.05125674
-2.834974 0.1946773 0.06048549 1.06462885 -0.45987914 0.7133022 0.6227126 -0.23948531 -0.3582658 -0.1557370949-0.24432275 0.1345659 0.5772376 0.08953590
-2.770174 0.4328708 0.06397874 1.12852010 -0.38180095 0.6552062 0.7038004 0.10252408 -0.4083493 -0.0004210339 0.00775053 0.2213430 0.7776655 0.14816451
-2.298644 -0.3283548 -0.44993218 0.69305853 -0.29988892 0.5822707 0.6480631 -0.13247191 -0.4656342 0.1100695972-0.48317508 0.3511512 0.4333900 0.02416410
In [20]:
# Components vs. variables (i.e., variables' coefficients)
pca_boston$rotation
PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12PC13PC14
crim 0.242284451-0.065873108 0.395077419-0.100366211 0.004957659-0.22462703 0.777083366-0.15740140 0.254211798-0.071384615-0.071068781 0.06327612 0.097032312 0.059114176
zn-0.245435005-0.148002653 0.394545713-0.342958421 0.114495002-0.33574694 -0.274178365 0.38031404 0.382899480 0.245579673-0.127709065-0.22112210 -0.132375830-0.096296807
indus 0.331859746 0.127075668-0.066081913 0.009626936-0.022583692-0.08082495 -0.340273839-0.17174578 0.627048264-0.254827026 0.273797614 0.34840828 0.083716854-0.235472877
chas-0.005027133 0.410668763-0.125305293-0.700406497-0.535197817 0.16264906 0.074075775 0.03292700 -0.018642967-0.041706916-0.009968402-0.01903975 -0.049917454 0.023488966
nox 0.325193880 0.254276363-0.046475549-0.053707583 0.194605570-0.14899191 -0.198092965-0.04745838 -0.043024391-0.211620349-0.437475550-0.44909357 0.524974687 0.087649148
rm-0.202816554 0.434005810 0.353406095 0.293357309-0.008320481 0.13108056 0.074084938 0.43761566 -0.003666947-0.526133916 0.223951923-0.12560554 -0.049893596 0.007190515
age 0.296976574 0.260303205-0.200823078 0.078426326 0.149750092-0.06086960 0.118580363 0.58810569 -0.043265822 0.245647942-0.329630928 0.48633905 -0.051462562-0.038227027
dis-0.298169809-0.359149977 0.157068710-0.184747787-0.106219480 0.01162335 -0.104397844 0.12823060 -0.175802196-0.299412026-0.114600078 0.49356822 0.552292172 0.047124029
rad 0.303412754 0.031149596 0.418510334 0.051374381-0.230352185-0.13493732 -0.137080107-0.07464872 -0.463439397 0.115793486 0.042213348 0.01863641 -0.006278474-0.634975332
tax 0.324033052 0.008851406 0.343232194 0.026810695-0.163425820-0.18847146 -0.313984433-0.07099212 -0.179446555-0.008366413 0.042794054 0.17042179 -0.242987756 0.698822190
ptratio 0.207679535-0.314623061 0.000399092 0.342036328-0.615707380 0.27901731 0.001485608 0.28346960 0.274525949 0.160474164-0.099991841-0.23214842 0.188347079 0.055738160
black-0.196638358 0.026481032-0.361375914 0.201741185-0.367460674-0.78590728 0.074842780 0.04444175 -0.060975651-0.146292237 0.039194858-0.04152885 -0.021078199-0.016165280
lstat 0.311397955-0.201245177-0.161060336-0.242621217 0.178358870-0.09197211 0.083213083 0.35748247 -0.171810921 0.066647267 0.683032690-0.18189209 0.249489863 0.083143795
medv-0.266636396 0.444924411 0.163188735 0.180297553-0.050659893-0.05402804 -0.009964973-0.15230879 0.070751083 0.575547284 0.242001064 0.09828580 0.469629324 0.134127182
In [15]:
?prcomp
In [21]:
# Plot the first two components
pca_scores = data.frame(pca_boston$x[,1:2])
plot(pca_scores)
In [22]:
data = pca_boston$x[,1:2]

# K-Means Clustering with k clusters
k = 3
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])
In [24]:
k = 4
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])
In [36]:
k = 5
fit <- kmeans(data, k)
pairs(data, col=c(1:k)[fit$cluster])

Elbow method

The elbow method consists of plotting the average ''sparsity'' of a clustering process (usually computed as a sum of squared differences within clusters - WCSS) with respect to the number of clusters, and picking the ''elbow'' of the function as the best number of clusters to use.

Try to implement the elbow method to determine the best choice for k, representing the number of clusters to be defined by the kmeans model. Then, perform the clustering analysis usign the k value.

Hint: if the analysis is correct, the elbow plot will be a not-increasing function.

In [40]:
names(fit)
  1. 'cluster'
  2. 'centers'
  3. 'totss'
  4. 'withinss'
  5. 'tot.withinss'
  6. 'betweenss'
  7. 'size'
  8. 'iter'
  9. 'ifault'
In [86]:
max_k = 50

dist_sequence = c()
for(k in seq(max_k)){
    tot_distance = c()
    fit <- kmeans(data, k)
    labels = fit$cluster
    centroids = fit$centers
    for(i in seq(k)){
        cent_x = centroids[i,1]
        cent_y = centroids[i,2]
        
        # within cluster sum of differences
        cluster_data = data[which(labels==i),]
        for (j in seq(nrow(cluster_data))){
            dist = sqrt((cluster_data[j,1] - cent_x)^2 + (cluster_data[j,2] - cent_y)^2)
            tot_distance = c(tot_distance, dist)
        }
    }
    dist_sequence = c(dist_sequence, mean(tot_distance)) # here we compute the mean sum of squared differences of the cluster

}
In [ ]:

In [87]:
plot(seq(max_k),dist_sequence, 
     xlab = '# of clusters',
     ylab = 'WCSS',
     type = "l", lty = 1)
# TODO: add also the BCSS to this plot
In [ ]: