Functions and CSV files

Statistical Laboratory

Alessandro Ortis - University of Catania

An R function is created by using the keyword function. The basic syntax of an R function definition is as follows.

In [ ]:
function_name <- function(arg_1, arg_2, ...) {
   Function body 
}
In [1]:
# Define the function (in R)
print_list <- function(ls) { 
    for(x in ls) {  
        print(x) 
    }
}
In [2]:
# Use the function
y = list(2,3,4,2,10,44,4)
print_list(y)
[1] 2
[1] 3
[1] 4
[1] 2
[1] 10
[1] 44
[1] 4

Exercise

Implement a function named 'my_mean' that takes a list of numbers as input and returns the mean of the list's elements, without using the built-in 'mean' function.

In [13]:
# Class solutions:

my_mean <- function(ls){
    total = 0
    num = 0
    for(x in ls){
        total = total + x
        num = num + 1
    }
    mean_val = total/num
    print(mean_val)
}

my_mean_2 <- function(ls){
    total = 0
    for(x in ls){
        total = x + total
    }
    return (total/length(ls))
}
    
my_mean_3 <- function(ls) return (sum(unlist(ls))/length(ls))

y = list(2,3,4,2,10,44,4)
my_mean(y)
res <- my_mean_2(y)
print(res)
print(my_mean_3(y))
[1] 9.857143
[1] 9.857143
[1] 9.857143
In [1]:
# Define the function
my_mean <- function(data)    {
    sum<- 0
    for (x in data)
        sum <- sum + x
    sum/length(data)
}
In [2]:
# Use the function
ls = c(2,4,5,2,1,4,5,3)
m = my_mean(ls)
print(m)
[1] 3.25
In [6]:
# Define the function
my_mean <- function(data)    {
    s <- sum(data)
    s/length(data)
    # or..
    # sum(data)/length(data)
}
In [8]:
# Use the function
d = c(2,2,1,2,3,3,1,3,3,1)
print(paste("The mean is: ",format(my_mean(d),digits= 2)))
[1] "The mean is:  2.1"

Exercise

Implement a function named 'my_var' that takes a list of numbers as input and returns the variance of the list's elements. To do so, exploit the above defined 'my_mean' function to compute the mean. Then, use the function 'my_var' to compute both the variance and the standard deviation of the data.

In [ ]:
# Class solutions:
my_mean_3 <- function(ls) return (sum(unlist(ls))/length(ls))
my_var <- function(ls) return (sum((unlist(ls)-my_mean_3(ls))/(length(ls)-1))
In [33]:
# Unfold the function                               
my_var <- function(ls) return (
                            sum(
                                    (
                                     unlist(ls)-my_mean_3(ls)
                                    )**2
                                )/(length(ls)-1))

y = list(2,3,4,2,10,44,4)
variance = my_var(y)
print(variance)
var(unlist(y)) # test
[1] 234.1429
234.142857142857
In [22]:
y = list(2,3,4,2,10,44,4)
mean(unlist(y))
9.85714285714286
In [31]:
# Define the function
my_var <- function(data)    {
   
    n <- length(data)
    m <- my_mean(data)
    
    s = 0
    for (x in data)
        s = s + (x - m)^2
    
    s/(n-1)
}
In [32]:
# Use the function
d = c(2,2,1,2,3,3,1,3,3,1)

print(paste("The mean is:          ",format(my_mean(d),digits= 2)))
print(paste("The variance is:      ",format(my_var(d),digits= 2)))
print(paste("The variance is:      ",format(var(d),digits= 2)))
print(paste("The standard dev. is: ",format(sqrt(my_var(d)),digits= 2)))
[1] 2.1
[1] "The mean is:           2.1"
[1] 2.1
[1] "The variance is:       0.77"
[1] "The variance is:       0.77"
[1] 2.1
[1] "The standard dev. is:  0.88"

Mode

R does not have a built-in function to compute the mode. But we can create a proper function. To implement such a function we will exploit some useful R built-in functions to explore and filter data.

Unique function

The function 'unique' returns a vector, data frame or array like x but with duplicate elements/rows removed.

In [34]:
d = c(2,2,1,2,3,3,1,3,3,1)
unique(d)
  1. 2
  2. 1
  3. 3

Tabulate function

The tabulate function takes the integer-valued vector and counts the number of times each integer occurs in it.

In [40]:
d = c(2,2,1,2,3,3,1,3,3,3,3,3,3,3,1,1)
tabulate(d)
  1. 4
  2. 3
  3. 9
In [41]:
d = c(d,8,8,8)
tabulate(d)
  1. 4
  2. 3
  3. 9
  4. 0
  5. 0
  6. 0
  7. 0
  8. 3

Which function

The which function will return the position of the elements (i.e., row number/column number/array index) in a logical vector which are TRUE.

In [19]:
letters <- c('a','b','c','b','e','b')
which(letters == 'b')
  1. 2
  2. 4
  3. 6
In [44]:
numbers <- c(12,43,3,1,6)
which(numbers == 3)
which(numbers != 3)
3
  1. 1
  2. 2
  3. 4
  4. 5
In [45]:
which.max(numbers)
which.min(numbers)
2
4

Match function

The match Function in R , returns the position of first occurrence of elements of Vector 1 in Vector 2. If an element of vector 1 doesn’t match any element of vector 2 then it returns “NA”. Output of Match Function in R will be a vector.

In [46]:
print(match(5, c(1,2,9,5,3,6,7,4,5)))
[1] 4
In [48]:
v1 <- c('d','b','c','a')
v2 <- c('x','x','x','d','c')
print(match(v1,v2))
[1]  4 NA  5 NA

getmode function

Now we have all the elements we need to define a function that takes an array and gets the mode.

In [55]:
# Create the 'getmode' function
getmode <- function(data) {
   tab_d <- tabulate(data)
   max_d <- max(tab_d)
   index_d <- which.max(tab_d == max_d)
   return (data[index_d])
}
In [56]:
d = c(2,2,2,2,2,2,2,3,3,3,1,4)
getmode(d)
2
In [57]:
# Alternative (using match and unique)
getmode <- function(data) {
   uniq_d <- unique(data)
   tab_d <- tabulate(match(data, uniq_d))
   index_d <- which.max(tab_d)
   return (uniq_d[index_d])
}
In [52]:
d = c(2,2,2,2,2,2,2,3,3,3,1,4)
getmode(d)
2

Useful file functions

In [30]:
# Get the current working directory
getwd()
'C:/Users/aless/OneDrive - Università degli Studi di Catania/Didattica/Statistica (economia)/LABs/StatsLab'
In [ ]:
# Set the working directory
setwd("YOUR WANTED WORKING DIRECTORY PATH")
In [58]:
# We already observed how to read an existing CSV file
# '..' means 'previous directory'
auto = read.csv("../Datasets/Auto.csv")
head(auto)
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
15 8 350 165 3693 11.5 70 1 buick skylark 320
18 8 318 150 3436 11.0 70 1 plymouth satellite
16 8 304 150 3433 12.0 70 1 amc rebel sst
17 8 302 140 3449 10.5 70 1 ford torino
15 8 429 198 4341 10.0 70 1 ford galaxie 500
In [59]:
# When needed, we can also create new CSV 
# files with our data
my_data = data.frame(
    "mpg" = auto$mpg, 
    "name" = auto$name)

head(my_data)

# Write the CSV file
write.csv(my_data,"mpgdata.csv")
mpgname
18 chevrolet chevelle malibu
15 buick skylark 320
18 plymouth satellite
16 amc rebel sst
17 ford torino
15 ford galaxie 500
In [60]:
data = read.csv("mpgdata.csv")
head(data)
Xmpgname
1 18 chevrolet chevelle malibu
2 15 buick skylark 320
3 18 plymouth satellite
4 16 amc rebel sst
5 17 ford torino
6 15 ford galaxie 500

Here the column X comes from the data set newper. This can be dropped using additional parameters while writing the file.

In [61]:
write.csv(my_data,"mpgdata.csv", row.names=FALSE)
In [62]:
data = read.csv("mpgdata.csv")
head(data)
mpgname
18 chevrolet chevelle malibu
15 buick skylark 320
18 plymouth satellite
16 amc rebel sst
17 ford torino
15 ford galaxie 500
In [ ]: