Bivariate Analysis¶

Statistical Laboratory

Alessandro Ortis - University of Catania

Bivariate analysis is the simultaneous analysis of two attributes aimed to explore the relationships between pairs of variables, assess the presence and the strength of such relationship. There are different types of bivariate anlysis depending on the nature of the variables (numerical or categorical).

First, load the iris dataset, which will be used in the following examples.

library(datasets)
data(iris)
summary(iris)
head(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

names(iris)

Numerical vs. Numerical¶

When both variables are numerical, it could be useful to visualize the scatterplot reperesentation. It could suggest the shape of the function that relates the two variables (e.g., linear, polinomial, trigonometric, etc.).

pairs(iris[1:4]) # Takes all the columns except the column 'Species'.

Combining the parameters 'pch' and 'bg' is possible to distinguish data belonging to different categories in the same scatterplot.

#?unclass
iris$Species
unclass(iris$Species)

pairs(iris[1:4], main = "Iris data species", 
      pch = 21, bg = c("red", "green", "blue")[unclass(iris$Species)]) # Assigns a color based on the Species value.

Linear correlation¶

Linear correlation quantifies the strength of a linear relationship between two numerical variables. When there is no correlation between two variables, there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity, -1 means perfect negative linear correlation and +1 means perfect positive linear correlation and zero means no linear correlation.

cor(iris[1:4])

Numerical vs. Categorical¶

We can add error-bars on any existing plot. The error-bars can display information about the the variability of the data such as the minimum and maximum values, the variance, the standard error or the standard deviation. For instance, a line chart draws a line that is obtained considering the average of the nunmerical data, the error-bars provide a way to understand how the average is summarizing the data.

I want to plot the average values of Sepal.Length, after grouping the data by the class. Therefore, I will obtain an average value for each species. To this aim, I need the function aggregate().

Let's first giv a look to this function.

?aggregate

# we can aggregate considering each of the 4 categories
aggregate(iris[1:4],
          list(iris$Species),
          mean)

# consider only the Sepal.Length variable
sep_len_avg <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          mean)
sep_len_avg

group_1 = sep_len_avg$Group.1
group_names = levels(group_1)
group_names

plot(c(1,2,3), 
     sep_len_avg$x,
     xlab="Iris species",
     ylab="Avg Sepal Length",
     type = "b",   # 'l' for lines, 'b' for both lines and points
     xlim = c(0, 4), 
     ylim = c(0, 10),
     panel.first = grid(),
      lwd=3,
      col = "blue")

# Change the x-axis values
axis(side = 1,
    at=c(1,2,3),
    labels = group_names)

By the aggregate function we can apply any function to groups of data splitted by their category. In the following, the most common examples that apply the functions min, max, sd, as well as a own-defined function for the standard error.

# Compute the min and max of each group
sep_len_min <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          min)
sep_len_min

sep_len_max <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          max)
sep_len_max

#computation of the standard deviation
sep_len_sd <-aggregate(iris$Sepal.Length,
          list(iris$Species),
          sd)
sep_len_sd

# define a function to compute the standard error of the mean
std_error <- function(x) {
    sd(x)/sqrt(length(x))
    }

#computation of the standard error
sep_len_stderr <- aggregate(iris$Sepal.Length,
          list(iris$Species),
          std_error)
sep_len_stderr

Now, exploit the function 'barplot' from the Hmisc library to visualize the standard deviation of the data grouped by Species.

#install.packages("Hmisc") #only once
library(Hmisc)

Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

errbar(x = c(1,2,3),
       y = sep_len_avg$x,
       yplus= sep_len_avg$x + sep_len_sd$x,
       yminus=sep_len_avg$x - sep_len_sd$x,
       xlab="Iris species",
       ylab="Avg Sepal Length",
       type = "b",   # 'l' for lines, 'b' for both lines and points
       xlim = c(0, 4), 
       ylim = c(4, 8),
       panel.first = grid(),
       lwd=3,
       col = "blue"    )
     
axis(side = 1,
    at=c(1,2,3),
    labels = group_names)

The t-test assess whether the averages of two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a categorical variable. Examples of this procedure will be illustrated during the lecture on statistical inference.

Categorical vs. Categorical¶

The stacked column plot compares the percentage that each category from one variable contributes to a total across categories of the second variable. In this case we can't use the iris dataset because we need two categorical variables. In particular, we are going to compare the 'vs' (type of engine) and 'cyl' (num. of cylinders) of the 'mtcars' dataset.

head(mtcars)

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts,
        main="Car distribution by Gears and VS",
        xlab="Number of Gears",
        col=c("blue","red"),
        legend = rownames(counts))

The chi-square test can be used to determine the association between categorical variables. This procedure will be illustrated during the lecture on statistical inference.

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length	1.0000000	-0.1175698	0.8717538	0.8179411
Sepal.Width	-0.1175698	1.0000000	-0.4284401	-0.3661259
Petal.Length	0.8717538	-0.4284401	1.0000000	0.9628654
Petal.Width	0.8179411	-0.3661259	0.9628654	1.0000000

Group.1	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1