Statistical Laboratory
Alessandro Ortis - University of Catania
A very common problem in data analysis is to distinguish betweeen useful than not relevant inputs, depending on the aimed results. This problem is known as feature selection.
In the case of categorial variables, we can use statistical tests to determine if an output variable is dependent or not dependent from the input variables. If we assess the independence with a specific input variable, we may choose to exlude it from subsequent analysis.
A contingency table is a table that displays the (multivariate) frequency distribution of the variables, used in statistics to present categorical data in terms of frequency counts.
Consider the example of a dataset which contains the favourite sport of a group of people. We are interested in the relationships between the favourite sport and the gender.
Suppose to have the raw data (as in the most of real cases). Then, we don't need the frequencies grouped by category. First, let's see how to create such raw data. An example row is defined by a dataframe with two properties: gender and sport.
df = data.frame(gender = 'M', sport = 'Soccer')
df
We can use 'replicate' to create more copies of the same row in a dataframe. As an example, create a dataframe with 5 rows equal to the one created above.
?replicate
do.call("rbind", replicate(5, df, simplify = FALSE))
We can create a function to do the work everytime we need.
repl_row <- function(n, row) {
do.call("rbind", replicate(n, row, simplify = FALSE))
}
# Example of usage
repl_row(3,data.frame(gender = 'M', sport = 'Soccer'))
# Create a dataset gender vs. favourite sport
# STEP 1: create the examples for all possible combinations of gender and sport.
male_soccer = data.frame(gender = 'M', sport = 'Soccer')
male_running = data.frame(gender = 'M', sport = 'Running')
male_dance = data.frame(gender = 'M', sport = 'Dance')
female_soccer = data.frame(gender = 'F', sport = 'Soccer')
female_running = data.frame(gender = 'F', sport = 'Running')
female_dance = data.frame(gender = 'F', sport = 'Dance')
# STEP 2: replicate the rows of each category, depending on the frequencies of the data we want to represent
# 60 females dancing
df_1 = repl_row(60, female_dance)
# 30 females running
df_2 = repl_row(30, female_running)
# 10 females play soccer
df_3 = repl_row(10, female_soccer)
# 10 males dancing
df_4 = repl_row(10, male_dance)
# 40 males running
df_5 = repl_row(40, male_running)
# 50 males play soccer
df_6 = repl_row(50, male_soccer)
# STEP 3: bind all the records and output a summary of the data
my_data <- rbind(df_1, df_2, df_3, df_4, df_5, df_6)
summary(my_data)
# Create the contingency table of our data
table(my_data)
The table above shows the joint distribution of two categorical variables (gender and sport). Such tables are called contingency tables.
Suppose that we know the frequencies grouped by both the categories.
A simpler alternative to create directly the contingency table is given by the following code.
observed_table <- matrix(c(60, 30, 10, 10, 40, 50), nrow = 2, ncol = 3, byrow = T)
rownames(observed_table) <- c('F', 'M')
colnames(observed_table) <- c('Dance', 'Running', 'Soccer')
observed_table
cont_table <- table(my_data)
rowSums(cont_table)
colSums(cont_table)
The function 'prop.table()' gives the frequencies, percentages are then obtained by multiplying by 100.
prop.table(cont_table) *100
This is a joint probability distribution, for example we can see that 15% of people in the dataset are female who like running.
Often we are interested in conditional distributions. A conditional distribution is a probability distribution for a sub-population. In other words, we are interested in the distribution of one variable within groups created by another. To this aim we use the margin argument to prop.table function.
With margin = 1 we are grouping the data by the rows values of the contingency table (i.e., gender). In the following example, the percentages are expressed with respect to the total of males/females (i.e., summing by rows gives 100).
prop.table(cont_table, margin = 1) *100
With margin = 2 we are grouping the data by the columns values of the contingency table (i.e., sport). In the following example, the percentages are expressed with respect to the total of people that like the same sport category.
prop.table(cont_table, margin = 2) *100
The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test.
X <- chisq.test(cont_table)
X
From the above result, we can observe that the p-value is less than the significance level (0.05). Hence, we can reject the null hypothesis and conclude that the two variables are not independent.
Reminder:
The chi-squared test is based on the difference between the experimental data table and an expected table, which can be visualized after the test is perfomed. In addition, we can print the experimental table for comparison.
X$expected
cont_table
Is important to notice that the chi-squared value is affected by the dimension of the sample and by the frequencies of the contingency table. Indeed, if we repeat the experiment on a lower subset of data we may obtain unreliable results.
sub_table = floor(cont_table/5)
sub_table
X <- chisq.test(sub_table)
X
We can again reject the null hypothesys, with a p-value of 0.16 %, whereas in the previous example the probability to take a wrong decision rejecting the null hypotesys was 0.0000000000139 %.
In general, the chi-squared test can be reliable for tables with frequencies greater than 30.
Now let's consider an example where H0 is not rejected.
observed_table <- matrix(c(10, 20, 30, 6, 9, 17), nrow = 2, ncol = 3, byrow = T)
rownames(observed_table) <- c('F', 'M')
colnames(observed_table) <- c('Science', 'Math', 'StatLab')
observed_table
X <- chisq.test(observed_table)
X
From the above result, we can observe that the probability of making a mistake by rejecting H0 is 87% (the probability that X-sq == 0.27 when H0 is true). Therefore we can't reject H0 and conclude that the two variables are independent.
T-tests is one of the most common tests in statistics. We use it to determine whether the means of two groups are similar to each other. The assumption for the test is that both groups are sampled from normal distributions with equal variances. The null hypothesis is that the two means are equal, and the alternative is that they are not equal. It is known that under the null hypothesis, we can calculate a t-statistic that will follow a t-distribution.
If the probability of t is small, the difference between two averages is significant.
The p-value is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. For instances, at 95% level of confidence, the significant level is 5% and the p-value is reported as p<0.05. Small p-values suggest that the null hypothesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypothesis.
Welch’s T-test is a user modification of the T-test that adjusts the number of degrees of freedom when the variances are thought not to be equal to each other.
The R function t.test() provides a variety of t-tests:
t.test(y~x) # where y is numeric and x is a binary factor
t.test(y1,y2) # where y1 and y2 are numeric
t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric
t.test(y,mu=3) # In this case, the null hypothesis Ho is that the mean is 3
set.seed(1234)
n = 10 #500
# n random samples for men's weights
dist_1 = rnorm(n, mean = 85.0, sd = 8)
dist_2 = rnorm(n, mean = 60.0, sd = 5)
# n random samples for women's weights
my_data = data.frame(gender = c(rep("M",n), rep("F",n)),
weight = c(dist_1, dist_2))
my_data
boxplot(weight ~ gender,
data = my_data,
xlab = "Gender",
ylab="Weight",
main = "Weight distributions")
Unpaired t-test can be calculated using the following function.
# Example of using the which function to select the men's rows
men = my_data[which(my_data$gender == 'M'),2] # men is equal to dist_1
# Select the women's rows
women = my_data[which(my_data$gender == 'F'),2] # women is equal to dist_2
t_res = t.test(men,women)
names(t_res)
t_res
In the above result:
t is the Student t-test statistics value
df is the degrees of freedom
p-value is the significance level of the t-test
The confidence interval of the mean differences at 95% is [16.09, 28.96]), i.e., the interval in which the difference between the two means lies, with a probability equal to 95%
We also have the means of the two groups of samples
The two hypotheses for this particular two sample t-test are as follows:
H0: µ1 = µ2 (the two population means are equal)
H1: µ1 ≠µ2 (the two population means are not equal)
Since the p-value of our test is less than the significance level alpha = 0.05, we reject the null hypothesis of the test. This means we have sufficient evidence to say that the mean weight between the two populations is different with a p-value of 0.000001591 (i.e., probability of taking a wrong decision).
Remember that independent t-test can be used only when the two sets of data follow normal distributions with equal variances. However, by default the R t.test() function makes the assumption that the variances of the two groups of samples are different. Therefore, Welch t-test is performed by default (it is specified in the first row of the above output). Welch t-test is just an adaptation of t-test, and it is used when the two samples have possibly unequal variances.
Reapeat the t-test by comparing two samples extracted from the same PDF (with high sd) and comment the results. Is the test able to assess that the data come from the same PDF ?
set.seed(123)
a = rnorm(1000, mean=50, sd=300)
s1 = a[1:500]
s2 = a[501:1000]
t_res = t.test(s1,s2)
t_res
set.seed(123)
a = rnorm(1000, mean=8, sd=10)
s1 = a[1:100]
s2 = a[501:600]
boxplot(s1,s2)
t_res = t.test(s1,s2)
sample(1:10, 5)
idx = sample(1:1000, 200)
s1 = a[idx[1:100]]
s2 = a[idx[101:200]]
t_res = t.test(s1,s2)
boxplot(s1,s2)
t_res
boxplot(s1,s2)