Data types and data structures¶

Statistical Laboratory

Alessandro Ortis - University of Catania

Data Types¶

The variables are assigned with R-Objects and the data type of the R-Object becomes the data type of the variable. There are many types of R-Objects, the frequently used ones are:

Vectors (indexed arrays)
Lists (named lists)
Matrices (2D vectors)
Arrays (nD vectors)
Factors
Data Frames

The values contained in R-Objects belong to one of the following types.

# Boolean/logical
v <- TRUE       # assign to the new variable 'v' the boolean value TRUE
print(v)        # print the content of 'v'
print(class(v)) # print the class name of the value in 'v'

[1] TRUE
[1] "logical"

# Numeric
v <- 2.55
print(class(v))

[1] "numeric"

# Integer (long integer can represent a larger range of values wrt numeric)
v <- 2L
print(class(v))

[1] "integer"

# Complex
v <- 3 + 2i
print(class(v))

[1] "complex"

# Char
v <- 'a'
print(class(v))
v <- 'hello'
print(class(v))

[1] "character"
[1] "character"

# Raw (i.e., how the character are actually stored)
v <- charToRaw('a')
print(v)     # 61 is the code for the character 'a'
print(class(v))
print(charToRaw('b'))
print(charToRaw('A'))

[1] 61
[1] "raw"
[1] 62
[1] 41

Vectors vs. Lists¶

As we previously observed, to create vector with more than one element we can use the c() function, which means to combine the elements into a vector. Vectors can hold numeric, character or logical values.

However, the elements in the vector have the same data types, while the list contain different data types of elements like strings, char, numbers. It can also contain vectors or another list, matrix or a function inside it.

# vector of numeric
a <- c(2,3,3,4,5,6)
print(a)
print(class(a))

[1] 2 3 3 4 5 6
[1] "numeric"

Vector arithmetic¶

# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add <- v1+v2
print(add)

# Vector subtraction.
sub <- v1-v2
print(sub)

# Vector multiplication.
multi <- v1*v2
print(multi)

# Vector division.
div <- v1/v2
print(div)

[1]  7 19  4 13  1 13
[1] -1 -3  4 -3 -1  9
[1] 12 88  0 40  0 22
[1] 0.7500000 0.7272727       Inf 0.6250000 0.0000000 5.5000000

Vector sorting¶

# Create a vector
v <- c(3,8,4,5,0,11, -9, 300)

# Sort the elements of the vector.
sorted <- sort(v)
print(sorted)

# Sort the elements in the reverse order.
revsort <- sort(v, decreasing = TRUE)
print(revsort)

# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort_c <- sort(v)
print(sort_c)

# Sorting character vectors in reverse order.
revsort_c <- sort(v, decreasing = TRUE)
print(revsort_c)

[1]  -9   0   3   4   5   8  11 300
[1] 300  11   8   5   4   3   0  -9
[1] "Blue"   "Red"    "violet" "yellow"
[1] "yellow" "violet" "Red"    "Blue"

# but, if I want to mix data types inside a vector...
a <- c(2,3,4,'hello',2.5, TRUE)
print(a)
#...all elements are converted into characters and 'a' is now a vector of chars.
print(class(a))

[1] "2"     "3"     "4"     "hello" "2.5"   "TRUE" 
[1] "character"

a <- list(2,3,4)
a

a <- list(1,2,'a',3,'b')
a

a <- list(2,3,'hello', c(5,6), 34.5) # 'a' contains 5 sublists
print(a)
print(class(a))

[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] "hello"

[[4]]
[1] 5 6

[[5]]
[1] 34.5

[1] "list"

a <- list(2,    # element with index 1
          list(3,  # element with index 2,1
               3,  # element with index 2,2
               4,  # element with index 2,3
               5),   # element with index 2,4
          4)    # element with index 3
print(a)
print(class(a))

[[1]]
[1] 2

[[2]]
[[2]][[1]]
[1] 3

[[2]][[2]]
[1] 3

[[2]][[3]]
[1] 4

[[2]][[4]]
[1] 5


[[3]]
[1] 4

[1] "list"

Matrices vs. Arrays¶

?matrix

# matrices are 2 dimensional arrays
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)

     [,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a"

# arrays can be of any number of dimesions
A = array(c('hello','world'), dim = c(2,2))
print(A)

     [,1]    [,2]   
[1,] "hello" "hello"
[2,] "world" "world"

If there are too few elements in data to fill the array, then the elements in data are recycled.

A = array(c('hello','world'), dim = c(2,4))
print(A)

     [,1]    [,2]    [,3]    [,4]   
[1,] "hello" "hello" "hello" "hello"
[2,] "world" "world" "world" "world"

A = array(c('hello','world','today',
            'will','be','a',
            'very','long','day'), dim = c(2,3,4))
print(A)

, , 1

     [,1]    [,2]    [,3]
[1,] "hello" "today" "be"
[2,] "world" "will"  "a" 

, , 2

     [,1]   [,2]    [,3]   
[1,] "very" "day"   "world"
[2,] "long" "hello" "today"

, , 3

     [,1]   [,2]   [,3]  
[1,] "will" "a"    "long"
[2,] "be"   "very" "day" 

, , 4

     [,1]    [,2]    [,3]
[1,] "hello" "today" "be"
[2,] "world" "will"  "a"

Factors¶

A factor is the R-object created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. Factors are created using the factor() function. The nlevels functions gives the count of levels.

# Create a vector.
wheater <- c('sunny', 'sunny', 
               'sunny', 'cloudy',
               'rain', 'rain',
               'rain')

# Create a factor object.
factor_wheater <- factor(wheater)

# Print the factor.
print(factor_wheater)
print(nlevels(factor_wheater))

[1] sunny  sunny  sunny  cloudy rain   rain   rain  
Levels: cloudy rain sunny
[1] 3

Data Frames¶

We have seen data frames when we explored the dataset 'Auto'.

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

?data.frame

# Create a data frame
students <- data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(177, 175, 165), 
   weight = c(81,78,50),
   age = c(30L,27L,26L)
)
print(students)
# Transpose the students data frame
t(students)

  gender height weight age
1   Male    177     81  30
2   Male    175     78  27
3 Female    165     50  26

print(students$age)

[1] 30 27 26

# Select the students' weight and compute the mean weight
mw = mean(students$weight)
print(mw)

[1] 69.66667

summary(students)

    gender      height          weight           age       
 Female:1   Min.   :165.0   Min.   :50.00   Min.   :26.00  
 Male  :2   1st Qu.:170.0   1st Qu.:64.00   1st Qu.:26.50  
            Median :175.0   Median :78.00   Median :27.00  
            Mean   :172.3   Mean   :69.67   Mean   :27.67  
            3rd Qu.:176.0   3rd Qu.:79.50   3rd Qu.:28.50  
            Max.   :177.0   Max.   :81.00   Max.   :30.00

Strings¶

Any value written within a pair of single quote or double quotes in R is treated as a string. Internally R stores every string within double quotes.

General rules: The quotes at the beginning and end of a string should be both double quotes or both single quote. They can not be mixed.

Examples of valid strings:

# you can exploit the escape notation '\'
a <- 'Start and end \' with single quote'
print(a)

b <- "Start and end with double quotes"
print(b)

c <- "single quote ' in between double quotes"
print(c)

d <- 'Double quotes " in between single quote'
print(d)

[1] "Start and end ' with single quote"
[1] "Start and end with double quotes"
[1] "single quote ' in between double quotes"
[1] "Double quotes \" in between single quote"

Examples of not valid strings:

e <- 'Mixed quotes" 
print(e)

f <- 'Single quote ' inside single quote'
print(f)

g <- "Double quotes " inside double quotes"
print(g)

Error in parse(text = x, srcfile = src): <text>:4:7: unexpected symbol
3: 
4: f <- 'Single
         ^
Traceback:

String manipulation functions¶

paste(): concatenates strings
format(): formats numbers and strings
substring(): extracts parts of a string

s <- paste("hello", "world")
print(s)

[1] "hello world"

# optionally, we can specify a separator
s <- paste("hello", "world","more", "words", sep="---")
print(s)

[1] "hello---world---more---words"

Numbers and strings can be formatted to a specificy style using the format() function.

# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 4)
print(result)
result <- format(23.123456789, digits = 9)
print(result)
out_str <- paste("The performance is: ", result)
print(out_str)

[1] "23.12"
[1] "23.1234568"
[1] "The performance is:  23.1234568"

# Display numbers in scientific notation.
result <- format(6, scientific = TRUE)
print(result)
result <- format(0.001314521, scientific = TRUE)
print(result)
result <- format(123.998, scientific = TRUE)
print(result)
# you can also input a list of numbers...
result <- format(c(6, 123.345), scientific = TRUE)
print(result)

[1] "6e+00"
[1] "1.314521e-03"
[1] "1.23998e+02"
[1] "6.00000e+00" "1.23345e+02"

# The minimum number of digits to the right of the decimal point.
result <- format(c(4,
                   2,
                   1.41,
                   99.2,
                   12.21548772,
                   23.47),
                   nsmall = 5)
print(result)

[1] " 4.00000" " 2.00000" " 1.41000" "99.20000" "12.21549" "23.47000"

# Format treats everything as a string.
result <- format(6)
print(result)

[1] "6"

?format

# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
result <- format(123456, width = 6)
print(result)
result <- format(123456.789, width = 6)
print(result)
# to have the same format for all three...
result <- format(c(13.7,
                  123456,
                  123456.789), width = 6)
print(result)

[1] "  13.7"
[1] "123456"
[1] "123456.8"
[1] "    13.7" "123456.0" "123456.8"

# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)

[1] "Hello   "

# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)

[1] " Hello  "

# Extract characters from 5th to 7th position.
result <- substring("StatisticalLaboratory", 5, 7)
print(result)
print(substring("HelloWorld", 6,10))

[1] "ist"
[1] "World"

Data Reshaping¶

Data reshaping allows to change the way data is organized into rows and columns.

Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are cases when we need the data frame in a format that is different from format in which we received it.

R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.

# Create vector objects.
city <- c("Catania", "Seattle", "Boston")
state <- c("IT",     "WA",      "MA")
zipcode <- c(95030,98104,02101)

# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)

# Print the data frame.
print(addresses)

     city      state zipcode
[1,] "Catania" "IT"  "95030"
[2,] "Seattle" "WA"  "98104"
[3,] "Boston"  "MA"  "2101"

# Create another data frame with similar columns
new_address <- data.frame(
   city = c("Lowry","Charlotte"),
   state = c("CO","FL"),
   zipcode = c("80230","33949")
#   stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n") 

# Print the data frame.
print(new_address)

# # # The Second data frame
       city state zipcode
1     Lowry    CO   80230
2 Charlotte    FL   33949

?rbind

# Combine rows form both the data frames.
combined_addresses <- rbind(addresses,new_address)

# Print a header.
cat("# # # The combined data frame\n") 

# Print the result.
print(combined_addresses)

# # # The combined data frame
       city state zipcode
1   Catania    IT   95030
2   Seattle    WA   98104
3    Boston    MA    2101
4     Lowry    CO   80230
5 Charlotte    FL   33949