Bivariate Analysis

Statistical Laboratory

Alessandro Ortis - University of Catania

Bivariate analysis is the simultaneous analysis of two attributes aimed to explore the relationships between pairs of variables, assess the presence and the strength of such relationship. There are different types of bivariate anlysis depending on the nature of the variables (numerical or categorical).

First, load the iris dataset, which will be used in the following examples.

In [2]:
library(datasets)
data(iris)
summary(iris)
head(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
In [2]:
names(iris)
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'

Numerical vs. Numerical

When both variables are numerical, it could be useful to visualize the scatterplot reperesentation. It could suggest the shape of the function that relates the two variables (e.g., linear, polinomial, trigonometric, etc.).

In [3]:
pairs(iris[1:4]) # Takes all the columns except the column 'Species'.

Combining the parameters 'pch' and 'bg' is possible to distinguish data belonging to different categories in the same scatterplot.

In [4]:
#?unclass
iris$Species
unclass(iris$Species)
  1. setosa
  2. setosa
  3. setosa
  4. setosa
  5. setosa
  6. setosa
  7. setosa
  8. setosa
  9. setosa
  10. setosa
  11. setosa
  12. setosa
  13. setosa
  14. setosa
  15. setosa
  16. setosa
  17. setosa
  18. setosa
  19. setosa
  20. setosa
  21. setosa
  22. setosa
  23. setosa
  24. setosa
  25. setosa
  26. setosa
  27. setosa
  28. setosa
  29. setosa
  30. setosa
  31. setosa
  32. setosa
  33. setosa
  34. setosa
  35. setosa
  36. setosa
  37. setosa
  38. setosa
  39. setosa
  40. setosa
  41. setosa
  42. setosa
  43. setosa
  44. setosa
  45. setosa
  46. setosa
  47. setosa
  48. setosa
  49. setosa
  50. setosa
  51. versicolor
  52. versicolor
  53. versicolor
  54. versicolor
  55. versicolor
  56. versicolor
  57. versicolor
  58. versicolor
  59. versicolor
  60. versicolor
  61. versicolor
  62. versicolor
  63. versicolor
  64. versicolor
  65. versicolor
  66. versicolor
  67. versicolor
  68. versicolor
  69. versicolor
  70. versicolor
  71. versicolor
  72. versicolor
  73. versicolor
  74. versicolor
  75. versicolor
  76. versicolor
  77. versicolor
  78. versicolor
  79. versicolor
  80. versicolor
  81. versicolor
  82. versicolor
  83. versicolor
  84. versicolor
  85. versicolor
  86. versicolor
  87. versicolor
  88. versicolor
  89. versicolor
  90. versicolor
  91. versicolor
  92. versicolor
  93. versicolor
  94. versicolor
  95. versicolor
  96. versicolor
  97. versicolor
  98. versicolor
  99. versicolor
  100. versicolor
  101. virginica
  102. virginica
  103. virginica
  104. virginica
  105. virginica
  106. virginica
  107. virginica
  108. virginica
  109. virginica
  110. virginica
  111. virginica
  112. virginica
  113. virginica
  114. virginica
  115. virginica
  116. virginica
  117. virginica
  118. virginica
  119. virginica
  120. virginica
  121. virginica
  122. virginica
  123. virginica
  124. virginica
  125. virginica
  126. virginica
  127. virginica
  128. virginica
  129. virginica
  130. virginica
  131. virginica
  132. virginica
  133. virginica
  134. virginica
  135. virginica
  136. virginica
  137. virginica
  138. virginica
  139. virginica
  140. virginica
  141. virginica
  142. virginica
  143. virginica
  144. virginica
  145. virginica
  146. virginica
  147. virginica
  148. virginica
  149. virginica
  150. virginica
Levels:
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'
  1. 1
  2. 1
  3. 1
  4. 1
  5. 1
  6. 1
  7. 1
  8. 1
  9. 1
  10. 1
  11. 1
  12. 1
  13. 1
  14. 1
  15. 1
  16. 1
  17. 1
  18. 1
  19. 1
  20. 1
  21. 1
  22. 1
  23. 1
  24. 1
  25. 1
  26. 1
  27. 1
  28. 1
  29. 1
  30. 1
  31. 1
  32. 1
  33. 1
  34. 1
  35. 1
  36. 1
  37. 1
  38. 1
  39. 1
  40. 1
  41. 1
  42. 1
  43. 1
  44. 1
  45. 1
  46. 1
  47. 1
  48. 1
  49. 1
  50. 1
  51. 2
  52. 2
  53. 2
  54. 2
  55. 2
  56. 2
  57. 2
  58. 2
  59. 2
  60. 2
  61. 2
  62. 2
  63. 2
  64. 2
  65. 2
  66. 2
  67. 2
  68. 2
  69. 2
  70. 2
  71. 2
  72. 2
  73. 2
  74. 2
  75. 2
  76. 2
  77. 2
  78. 2
  79. 2
  80. 2
  81. 2
  82. 2
  83. 2
  84. 2
  85. 2
  86. 2
  87. 2
  88. 2
  89. 2
  90. 2
  91. 2
  92. 2
  93. 2
  94. 2
  95. 2
  96. 2
  97. 2
  98. 2
  99. 2
  100. 2
  101. 3
  102. 3
  103. 3
  104. 3
  105. 3
  106. 3
  107. 3
  108. 3
  109. 3
  110. 3
  111. 3
  112. 3
  113. 3
  114. 3
  115. 3
  116. 3
  117. 3
  118. 3
  119. 3
  120. 3
  121. 3
  122. 3
  123. 3
  124. 3
  125. 3
  126. 3
  127. 3
  128. 3
  129. 3
  130. 3
  131. 3
  132. 3
  133. 3
  134. 3
  135. 3
  136. 3
  137. 3
  138. 3
  139. 3
  140. 3
  141. 3
  142. 3
  143. 3
  144. 3
  145. 3
  146. 3
  147. 3
  148. 3
  149. 3
  150. 3
In [5]:
pairs(iris[1:4], main = "Iris data species", 
      pch = 21, bg = c("red", "green", "blue")[unclass(iris$Species)]) # Assigns a color based on the Species value.

Linear correlation

Linear correlation quantifies the strength of a linear relationship between two numerical variables. When there is no correlation between two variables, there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity, -1 means perfect negative linear correlation and +1 means perfect positive linear correlation and zero means no linear correlation.

In [7]:
cor(iris[1:4])
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length 1.0000000-0.1175698 0.8717538 0.8179411
Sepal.Width-0.1175698 1.0000000-0.4284401-0.3661259
Petal.Length 0.8717538-0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411-0.3661259 0.9628654 1.0000000

Numerical vs. Categorical

We can add error-bars on any existing plot. The error-bars can display information about the the variability of the data such as the minimum and maximum values, the variance, the standard error or the standard deviation. For instance, a line chart draws a line that is obtained considering the average of the nunmerical data, the error-bars provide a way to understand how the average is summarizing the data.

I want to plot the average values of Sepal.Length, after grouping the data by the class. Therefore, I will obtain an average value for each species. To this aim, I need the function aggregate().

Let's first giv a look to this function.

In [7]:
?aggregate
In [9]:
# we can aggregate considering each of the 4 categories
aggregate(iris[1:4],
          list(iris$Species),
          mean)
Group.1Sepal.LengthSepal.WidthPetal.LengthPetal.Width
setosa 5.006 3.428 1.462 0.246
versicolor5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
In [9]:
# consider only the Sepal.Length variable
sep_len_avg <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          mean)
sep_len_avg

group_1 = sep_len_avg$Group.1
group_names = levels(group_1)
group_names
Group.1x
setosa 5.006
versicolor5.936
virginica 6.588
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'
In [10]:
plot(c(1,2,3), 
     sep_len_avg$x,
     xlab="Iris species",
     ylab="Avg Sepal Length",
     type = "b",   # 'l' for lines, 'b' for both lines and points
     xlim = c(0, 4), 
     ylim = c(0, 10),
     panel.first = grid(),
      lwd=3,
      col = "blue")

# Change the x-axis values
axis(side = 1,
    at=c(1,2,3),
    labels = group_names)

By the aggregate function we can apply any function to groups of data splitted by their category. In the following, the most common examples that apply the functions min, max, sd, as well as a own-defined function for the standard error.

In [11]:
# Compute the min and max of each group
sep_len_min <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          min)
sep_len_min

sep_len_max <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          max)
sep_len_max
Group.1x
setosa 4.3
versicolor4.9
virginica 4.9
Group.1x
setosa 5.8
versicolor7.0
virginica 7.9
In [12]:
#computation of the standard deviation
sep_len_sd <-aggregate(iris$Sepal.Length,
          list(iris$Species),
          sd)
sep_len_sd
Group.1x
setosa 0.3524897
versicolor0.5161711
virginica 0.6358796
In [13]:
# define a function to compute the standard error of the mean
std_error <- function(x) {
    sd(x)/sqrt(length(x))
    }
In [14]:
#computation of the standard error
sep_len_stderr <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          std_error)
sep_len_stderr
Group.1x
setosa 0.04984957
versicolor0.07299762
virginica 0.08992695

Now, exploit the function 'barplot' from the Hmisc library to visualize the standard deviation of the data grouped by Species.

In [16]:
#install.packages("Hmisc") #only once
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

In [17]:
errbar(x = c(1,2,3),
       y = sep_len_avg$x,
       yplus= sep_len_avg$x + sep_len_sd$x,
       yminus=sep_len_avg$x - sep_len_sd$x,
       xlab="Iris species",
       ylab="Avg Sepal Length",
       type = "b",   # 'l' for lines, 'b' for both lines and points
       xlim = c(0, 4), 
       ylim = c(4, 8),
       panel.first = grid(),
       lwd=3,
       col = "blue"    )
     
axis(side = 1,
    at=c(1,2,3),
    labels = group_names)

The t-test assess whether the averages of two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a categorical variable. Examples of this procedure will be illustrated during the lecture on statistical inference.

Categorical vs. Categorical

The stacked column plot compares the percentage that each category from one variable contributes to a total across categories of the second variable. In this case we can't use the iris dataset because we need two categorical variables. In particular, we are going to compare the 'vs' (type of engine) and 'cyl' (num. of cylinders) of the 'mtcars' dataset.

In [18]:
head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX421.0 6 160 110 3.90 2.62016.460 1 4 4
Mazda RX4 Wag21.0 6 160 110 3.90 2.87517.020 1 4 4
Datsun 71022.8 4 108 93 3.85 2.32018.611 1 4 1
Hornet 4 Drive21.4 6 258 110 3.08 3.21519.441 0 3 1
Hornet Sportabout18.7 8 360 175 3.15 3.44017.020 0 3 2
Valiant18.1 6 225 105 2.76 3.46020.221 0 3 1
In [19]:
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts,
        main="Car distribution by Gears and VS",
        xlab="Number of Gears",
        col=c("blue","red"),
        legend = rownames(counts))

The chi-square test can be used to determine the association between categorical variables. This procedure will be illustrated during the lecture on statistical inference.

In [ ]: