Data types and data structures

Statistical Laboratory

Alessandro Ortis - University of Catania

Data Types

The variables are assigned with R-Objects and the data type of the R-Object becomes the data type of the variable. There are many types of R-Objects, the frequently used ones are:

  • Vectors (indexed arrays)
  • Lists (named lists)
  • Matrices (2D vectors)
  • Arrays (nD vectors)
  • Factors
  • Data Frames

The values contained in R-Objects belong to one of the following types.

In [1]:
# Boolean/logical
v <- TRUE       # assign to the new variable 'v' the boolean value TRUE
print(v)        # print the content of 'v'
print(class(v)) # print the class name of the value in 'v'
[1] TRUE
[1] "logical"
In [2]:
# Numeric
v <- 2.55
print(class(v))
[1] "numeric"
In [3]:
# Integer (long integer can represent a larger range of values wrt numeric)
v <- 2L
print(class(v))
[1] "integer"
In [4]:
# Complex
v <- 3 + 2i
print(class(v))
[1] "complex"
In [5]:
# Char
v <- 'a'
print(class(v))
v <- 'hello'
print(class(v))
[1] "character"
[1] "character"
In [8]:
# Raw (i.e., how the character are actually stored)
v <- charToRaw('a')
print(v)     # 61 is the code for the character 'a'
print(class(v))
print(charToRaw('b'))
print(charToRaw('A'))
[1] 61
[1] "raw"
[1] 62
[1] 41

Vectors vs. Lists

As we previously observed, to create vector with more than one element we can use the c() function, which means to combine the elements into a vector. Vectors can hold numeric, character or logical values.

However, the elements in the vector have the same data types, while the list contain different data types of elements like strings, char, numbers. It can also contain vectors or another list, matrix or a function inside it.

In [6]:
# vector of numeric
a <- c(2,3,3,4,5,6)
print(a)
print(class(a))
[1] 2 3 3 4 5 6
[1] "numeric"

Vector arithmetic

In [9]:
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add <- v1+v2
print(add)

# Vector subtraction.
sub <- v1-v2
print(sub)

# Vector multiplication.
multi <- v1*v2
print(multi)

# Vector division.
div <- v1/v2
print(div)
[1]  7 19  4 13  1 13
[1] -1 -3  4 -3 -1  9
[1] 12 88  0 40  0 22
[1] 0.7500000 0.7272727       Inf 0.6250000 0.0000000 5.5000000

Vector sorting

In [8]:
# Create a vector
v <- c(3,8,4,5,0,11, -9, 300)

# Sort the elements of the vector.
sorted <- sort(v)
print(sorted)

# Sort the elements in the reverse order.
revsort <- sort(v, decreasing = TRUE)
print(revsort)

# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort_c <- sort(v)
print(sort_c)

# Sorting character vectors in reverse order.
revsort_c <- sort(v, decreasing = TRUE)
print(revsort_c)
[1]  -9   0   3   4   5   8  11 300
[1] 300  11   8   5   4   3   0  -9
[1] "Blue"   "Red"    "violet" "yellow"
[1] "yellow" "violet" "Red"    "Blue"  
In [10]:
# but, if I want to mix data types inside a vector...
a <- c(2,3,4,'hello',2.5, TRUE)
print(a)
#...all elements are converted into characters and 'a' is now a vector of chars.
print(class(a))
[1] "2"     "3"     "4"     "hello" "2.5"   "TRUE" 
[1] "character"
In [10]:
a <- list(2,3,4)
a
  1. 2
  2. 3
  3. 4
In [11]:
a <- list(1,2,'a',3,'b')
a
  1. 1
  2. 2
  3. 'a'
  4. 3
  5. 'b'
In [11]:
a <- list(2,3,'hello', c(5,6), 34.5) # 'a' contains 5 sublists
print(a)
print(class(a))
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] "hello"

[[4]]
[1] 5 6

[[5]]
[1] 34.5

[1] "list"
In [12]:
a <- list(2,    # element with index 1
          list(3,  # element with index 2,1
               3,  # element with index 2,2
               4,  # element with index 2,3
               5),   # element with index 2,4
          4)    # element with index 3
print(a)
print(class(a))
[[1]]
[1] 2

[[2]]
[[2]][[1]]
[1] 3

[[2]][[2]]
[1] 3

[[2]][[3]]
[1] 4

[[2]][[4]]
[1] 5


[[3]]
[1] 4

[1] "list"

Matrices vs. Arrays

In [1]:
?matrix
In [13]:
# matrices are 2 dimensional arrays
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
     [,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a" 
In [14]:
# arrays can be of any number of dimesions
A = array(c('hello','world'), dim = c(2,2))
print(A)
     [,1]    [,2]   
[1,] "hello" "hello"
[2,] "world" "world"

If there are too few elements in data to fill the array, then the elements in data are recycled.

In [15]:
A = array(c('hello','world'), dim = c(2,4))
print(A)
     [,1]    [,2]    [,3]    [,4]   
[1,] "hello" "hello" "hello" "hello"
[2,] "world" "world" "world" "world"
In [16]:
A = array(c('hello','world','today',
            'will','be','a',
            'very','long','day'), dim = c(2,3,4))
print(A)
, , 1

     [,1]    [,2]    [,3]
[1,] "hello" "today" "be"
[2,] "world" "will"  "a" 

, , 2

     [,1]   [,2]    [,3]   
[1,] "very" "day"   "world"
[2,] "long" "hello" "today"

, , 3

     [,1]   [,2]   [,3]  
[1,] "will" "a"    "long"
[2,] "be"   "very" "day" 

, , 4

     [,1]    [,2]    [,3]
[1,] "hello" "today" "be"
[2,] "world" "will"  "a" 

Factors

A factor is the R-object created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. Factors are created using the factor() function. The nlevels functions gives the count of levels.

In [17]:
# Create a vector.
wheater <- c('sunny', 'sunny', 
               'sunny', 'cloudy',
               'rain', 'rain',
               'rain')

# Create a factor object.
factor_wheater <- factor(wheater)

# Print the factor.
print(factor_wheater)
print(nlevels(factor_wheater))
[1] sunny  sunny  sunny  cloudy rain   rain   rain  
Levels: cloudy rain sunny
[1] 3

Data Frames

We have seen data frames when we explored the dataset 'Auto'.

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

In [49]:
?data.frame
In [18]:
# Create a data frame
students <- data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(177, 175, 165), 
   weight = c(81,78,50),
   age = c(30L,27L,26L)
)
print(students)
# Transpose the students data frame
t(students)
  gender height weight age
1   Male    177     81  30
2   Male    175     78  27
3 Female    165     50  26
genderMale Male Female
height177 175 165
weight81 78 50
age30 27 26
In [17]:
print(students$age)
[1] 30 27 26
In [18]:
# Select the students' weight and compute the mean weight
mw = mean(students$weight)
print(mw)
[1] 69.66667
In [21]:
summary(students)
    gender      height          weight           age       
 Female:1   Min.   :165.0   Min.   :50.00   Min.   :26.00  
 Male  :2   1st Qu.:170.0   1st Qu.:64.00   1st Qu.:26.50  
            Median :175.0   Median :78.00   Median :27.00  
            Mean   :172.3   Mean   :69.67   Mean   :27.67  
            3rd Qu.:176.0   3rd Qu.:79.50   3rd Qu.:28.50  
            Max.   :177.0   Max.   :81.00   Max.   :30.00  

Strings

Any value written within a pair of single quote or double quotes in R is treated as a string. Internally R stores every string within double quotes.

General rules: The quotes at the beginning and end of a string should be both double quotes or both single quote. They can not be mixed.

Examples of valid strings:

In [35]:
# you can exploit the escape notation '\'
a <- 'Start and end \' with single quote'
print(a)

b <- "Start and end with double quotes"
print(b)

c <- "single quote ' in between double quotes"
print(c)

d <- 'Double quotes " in between single quote'
print(d)
[1] "Start and end ' with single quote"
[1] "Start and end with double quotes"
[1] "single quote ' in between double quotes"
[1] "Double quotes \" in between single quote"

Examples of not valid strings:

In [36]:
e <- 'Mixed quotes" 
print(e)

f <- 'Single quote ' inside single quote'
print(f)

g <- "Double quotes " inside double quotes"
print(g)
Error in parse(text = x, srcfile = src): <text>:4:7: unexpected symbol
3: 
4: f <- 'Single
         ^
Traceback:

String manipulation functions

  • paste(): concatenates strings
  • format(): formats numbers and strings
  • substring(): extracts parts of a string
In [37]:
s <- paste("hello", "world")
print(s)
[1] "hello world"
In [25]:
# optionally, we can specify a separator
s <- paste("hello", "world","more", "words", sep="---")
print(s)
[1] "hello---world---more---words"

Numbers and strings can be formatted to a specificy style using the format() function.

In [19]:
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 4)
print(result)
result <- format(23.123456789, digits = 9)
print(result)
out_str <- paste("The performance is: ", result)
print(out_str)
[1] "23.12"
[1] "23.1234568"
[1] "The performance is:  23.1234568"
In [40]:
# Display numbers in scientific notation.
result <- format(6, scientific = TRUE)
print(result)
result <- format(0.001314521, scientific = TRUE)
print(result)
result <- format(123.998, scientific = TRUE)
print(result)
# you can also input a list of numbers...
result <- format(c(6, 123.345), scientific = TRUE)
print(result)
[1] "6e+00"
[1] "1.314521e-03"
[1] "1.23998e+02"
[1] "6.00000e+00" "1.23345e+02"
In [20]:
# The minimum number of digits to the right of the decimal point.
result <- format(c(4,
                   2,
                   1.41,
                   99.2,
                   12.21548772,
                   23.47),
                   nsmall = 5)
print(result)
[1] " 4.00000" " 2.00000" " 1.41000" "99.20000" "12.21549" "23.47000"
In [42]:
# Format treats everything as a string.
result <- format(6)
print(result)
[1] "6"
In [26]:
?format
In [22]:
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
result <- format(123456, width = 6)
print(result)
result <- format(123456.789, width = 6)
print(result)
# to have the same format for all three...
result <- format(c(13.7,
                  123456,
                  123456.789), width = 6)
print(result)
[1] "  13.7"
[1] "123456"
[1] "123456.8"
[1] "    13.7" "123456.0" "123456.8"
In [45]:
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
[1] "Hello   "
In [46]:
# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
[1] " Hello  "
In [23]:
# Extract characters from 5th to 7th position.
result <- substring("StatisticalLaboratory", 5, 7)
print(result)
print(substring("HelloWorld", 6,10))
[1] "ist"
[1] "World"

Data Reshaping

Data reshaping allows to change the way data is organized into rows and columns.

Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are cases when we need the data frame in a format that is different from format in which we received it.

R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.

In [24]:
# Create vector objects.
city <- c("Catania", "Seattle", "Boston")
state <- c("IT",     "WA",      "MA")
zipcode <- c(95030,98104,02101)

# Combine above three vectors into one data frame.
addresses <- cbind(city,state,zipcode)

# Print the data frame.
print(addresses)
     city      state zipcode
[1,] "Catania" "IT"  "95030"
[2,] "Seattle" "WA"  "98104"
[3,] "Boston"  "MA"  "2101" 
In [25]:
# Create another data frame with similar columns
new_address <- data.frame(
   city = c("Lowry","Charlotte"),
   state = c("CO","FL"),
   zipcode = c("80230","33949")
#   stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n") 

# Print the data frame.
print(new_address)
# # # The Second data frame
       city state zipcode
1     Lowry    CO   80230
2 Charlotte    FL   33949
In [27]:
?rbind
In [27]:
# Combine rows form both the data frames.
combined_addresses <- rbind(addresses,new_address)

# Print a header.
cat("# # # The combined data frame\n") 

# Print the result.
print(combined_addresses)
# # # The combined data frame
       city state zipcode
1   Catania    IT   95030
2   Seattle    WA   98104
3    Boston    MA    2101
4     Lowry    CO   80230
5 Charlotte    FL   33949
In [ ]:

In [ ]: