Bivariate analysis is the simultaneous analysis of two attributes aimed to explore the relationships between pairs of variables, assess the presence and the strength of such relationship. There are different types of bivariate anlysis depending on the nature of the variables (numerical or categorical).
First, load the iris dataset, which will be used in the following examples.
library(datasets)
data(iris)
summary(iris)
head(iris)
names(iris)
When both variables are numerical, it could be useful to visualize the scatterplot reperesentation. It could suggest the shape of the function that relates the two variables (e.g., linear, polinomial, trigonometric, etc.).
pairs(iris[1:4]) # Takes all the columns except the column 'Species'.
Combining the parameters 'pch' and 'bg' is possible to distinguish data belonging to different categories in the same scatterplot.
#?unclass
iris$Species
unclass(iris$Species)
pairs(iris[1:4], main = "Iris data species",
pch = 21, bg = c("red", "green", "blue")[unclass(iris$Species)]) # Assigns a color based on the Species value.
Linear correlation quantifies the strength of a linear relationship between two numerical variables. When there is no correlation between two variables, there is no tendency for the values of one quantity to increase or decrease with the values of the second quantity, -1 means perfect negative linear correlation and +1 means perfect positive linear correlation and zero means no linear correlation.
cor(iris[1:4])
We can add error-bars on any existing plot. The error-bars can display information about the the variability of the data such as the minimum and maximum values, the variance, the standard error or the standard deviation. For instance, a line chart draws a line that is obtained considering the average of the nunmerical data, the error-bars provide a way to understand how the average is summarizing the data.
I want to plot the average values of Sepal.Length, after grouping the data by the class. Therefore, I will obtain an average value for each species. To this aim, I need the function aggregate().
Let's first giv a look to this function.
?aggregate
# we can aggregate considering each of the 4 categories
aggregate(iris[1:4],
list(iris$Species),
mean)
# consider only the Sepal.Length variable
sep_len_avg <- aggregate(iris$Sepal.Length,
list(iris$Species),
mean)
sep_len_avg
group_1 = sep_len_avg$Group.1
group_names = levels(group_1)
group_names
plot(c(1,2,3),
sep_len_avg$x,
xlab="Iris species",
ylab="Avg Sepal Length",
type = "b", # 'l' for lines, 'b' for both lines and points
xlim = c(0, 4),
ylim = c(0, 10),
panel.first = grid(),
lwd=3,
col = "blue")
# Change the x-axis values
axis(side = 1,
at=c(1,2,3),
labels = group_names)
By the aggregate function we can apply any function to groups of data splitted by their category. In the following, the most common examples that apply the functions min, max, sd, as well as a own-defined function for the standard error.
# Compute the min and max of each group
sep_len_min <- aggregate(iris$Sepal.Length,
list(iris$Species),
min)
sep_len_min
sep_len_max <- aggregate(iris$Sepal.Length,
list(iris$Species),
max)
sep_len_max
#computation of the standard deviation
sep_len_sd <-aggregate(iris$Sepal.Length,
list(iris$Species),
sd)
sep_len_sd
# define a function to compute the standard error of the mean
std_error <- function(x) {
sd(x)/sqrt(length(x))
}
#computation of the standard error
sep_len_stderr <- aggregate(iris$Sepal.Length,
list(iris$Species),
std_error)
sep_len_stderr
Now, exploit the function 'barplot' from the Hmisc library to visualize the standard deviation of the data grouped by Species.
#install.packages("Hmisc") #only once
library(Hmisc)
errbar(x = c(1,2,3),
y = sep_len_avg$x,
yplus= sep_len_avg$x + sep_len_sd$x,
yminus=sep_len_avg$x - sep_len_sd$x,
xlab="Iris species",
ylab="Avg Sepal Length",
type = "b", # 'l' for lines, 'b' for both lines and points
xlim = c(0, 4),
ylim = c(4, 8),
panel.first = grid(),
lwd=3,
col = "blue" )
axis(side = 1,
at=c(1,2,3),
labels = group_names)
The t-test assess whether the averages of two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a categorical variable. Examples of this procedure will be illustrated during the lecture on statistical inference.
The stacked column plot compares the percentage that each category from one variable contributes to a total across categories of the second variable. In this case we can't use the iris dataset because we need two categorical variables. In particular, we are going to compare the 'vs' (type of engine) and 'cyl' (num. of cylinders) of the 'mtcars' dataset.
head(mtcars)
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts,
main="Car distribution by Gears and VS",
xlab="Number of Gears",
col=c("blue","red"),
legend = rownames(counts))
The chi-square test can be used to determine the association between categorical variables. This procedure will be illustrated during the lecture on statistical inference.