Names and values

Binding names to values

The <- assignment operator creates a binding or a reference from the name on the left to the object on the right. The lobstr package can be helpful for investigating R data structures. For example the obj_addr() or obj_addrs() functions from the lobstr package can be used to see the memory address of objects:

x <- "hello"
y <- x
lobstr::obj_addr(x)
#> [1] "0x5581cdefd650"
lobstr::obj_addr(y)
#> [1] "0x5581cdefd650"
lobstr::obj_addrs(list(x, y))
#> [1] "0x5581cdefd650"

Syntactically valid names

A syntactically valid name must consist of only:

letters (depends on system locale, generally advised to stick to a-zA-Z)
numbers (only US-ASCII
dots/periods (.)
underlines/underscores (_)

Syntactically valid names must start with:

A letter
A dot/period (but not followed by a number)

Names must not be any of the words reserved by R’s parser (see ?Reserved for the full list).

_aVar <- "hello"
#> Error: unexpected input in "_"
.1var <- "hello"
#> Error: unexpected symbol in ".1var"
TRUE <- "hello"
#> Error in TRUE <- "hello" : invalid (do_set) left-hand side to assignment

See ?make.names for more detail.

Non-syntactically valid names

You can use backticks if you need to use names that are not syntactically valid.

`_aVar` <- "hello"
`_aVar`
#> [1] "hello"
`.1var` <- "hello"
`.1var`
#> [1] "hello"
`TRUE` <- "hello"
`TRUE`
#> [1] "hello"

This can sometimes be helpful if you are loading external data. For example loading a CSV with underscores in the header.

Notice the check.names = FALSE parameter to the read.table function. This prevents automatic conversion to syntactically valid names.

df1 <- read.table(
    text = "_var1,_var_2\n0.01,1\n0.05,0",
    sep = ",",
    check.names = FALSE,
    header = TRUE
)
df1
#>   _var1 _var_2
#> 1  0.01      1
#> 2  0.05      0
df1$`_var1`
#> [1] 0.01 0.05

Copying objects

Generally objects are only copied when they are modified:

x <- c("hello", "world")
lobstr::obj_addr(x)
#> [1] "0x5581cf7376b8"
y <- x
# `x` and `y` reference the same memory address
lobstr::obj_addrs(list(x, y))
#> [1] "0x5581cf7376b8" "0x5581cf7376b8"
# modify `y`
y[[1]] <- "hi"
lobstr::obj_addrs(list(x, y)
#> [1] "0x5581cf7376b8" "0x5581cf737538"
# `y` has been copied to a new memory address
# modify `y` again
y[[2]] <- "everyone"
lobstr::obj_addr(y)
#> [1] "0x5581cf736338"
# memory address of `y` not changed

This is also true for objects that are used as arguments to functions:

aFunc <- function(arg1) {
    # a simple function that just returns the input
    return(arg1)
}
input <- c("hello", "world")
lobstr::obj_addr(input)
#> [1] "0x5581cfe5d988"
output <- aFunc(input)
lobstr::obj_addrs(list(input, output))
#> [1] "0x5581cfe5d988" "0x5581cfe5d988"
# `input` and `output` reference the same memory address

Modifying objects

Objects with a single name referencing it can usually be modified in place:

# if run in RStudio all lines need to be run together
x <- c(1L, 2L, 3L)
lobstr::obj_addr(x)
#> [1] "0x55ffb8d5b618"
x[[2]] <- 9L
lobstr::obj_addr(x)
#> [1] "0x55ffb8d5b618"

Objects with more than one name referencing it will not be modified in place:

x <- c(1L, 2L, 3L)
y <- x
lobstr::obj_addrs(list(x, y))
#> [1] "0x55ffb9328018" "0x55ffb9328018"
x[[2]] <- 9L
lobstr::obj_addrs(list(x, y))

Data structures

Vectors

Lists

Lists do not store values. They store references to values. The lobstr::ref() function can demonstrate that:

aVector <- c("hello", "world")
lobstr::obj_addr(aVector)
#> [1] "0x5581d159bae8"
lobstr::ref(aVector)
#> [1:0x5581d159bae8] <character>
# a single reference (the start of the vector)

aList <- list("hello", "world")
lobstr::obj_addr(aList)
#> [1] "0x5581d1704528"
lobstr::ref(aList)
#> █ [1:0x5581d1704528] <list> 
#> ├─[2:0x5581d0a7bbf0] <character> 
#> └─[3:0x5581d0a7bbb8] <character>

Lists are shallow copied. The list object and it’s references are copied. The values referenced are not copied.

When the copied list is modified the list object gets a new reference. Any of the list’s references that are modified get updated:

aList <- list("hello", "world")
lobstr::ref(aList)
#> [1:0x5581d2bb1738] <list> 
#> ├─[2:0x5581d13ed0b8] <character> 
#> └─[3:0x5581d13ed080] <character> 
# copy `aList`
anotherList <- aList
lobstr::obj_addr(anotherList)
#> [1] "0x5581d2bb1738"
# modify `anotherList`
anotherList[[1]] <- "hi"
lobstr::ref(aList, anotherList)
#> █ [1:0x5581d2bb1738] <list> 
#> ├─[2:0x5581d13ed0b8] <character> 
#> └─[3:0x5581d13ed080] <character> 
#>  
#> █ [4:0x5581d2bc7238] <list> 
#> ├─[5:0x5581d156eae8] <character> 
#> └─[3:0x5581d13ed080] 
# `anotherList` gets a new memory address
# the second object of `anotherList` still references the same
# memory address as the second object of `aList`

Memory is used efficiently when lists are just references to values:

x <- c(1L, 2L, 3L)
lobstr::obj_size(x)
#> 64 B
y <- list(x, x, x, x)
lobstr::obj_size(y)
#> 144 B
lobstr::obj_size(list(NULL, NULL, NULL, NULL))
#> 80 B
# the size of `y` is:
# the size of `x` (64 B) +
# the size of a 4 element list (80 B)
# = 144 B

The total size of multiple lists will not be the sum of the individual lists if they share references:

x <- list(1L, 2L, 3L)
lobstr::obj_size(x)
y <- list(x, 4L)
lobstr::obj_size(y)
lobstr::obj_size(x, y)
# the total size of `x` and `y` is just the size of `y`
# `y` contains all the the references of `x`

Data frames

Data frames are just lists of vectors. Their class attribute is data.frame and they have a row.names attribute.

You can construct them yourself instead of using data.frame():

diyDataFrame <- list(
  var1 = c(1, 2),
  var2 = c("hello", "world")
)
attr(diyDataFrame, "class") <- "data.frame"
attr(diyDataFrame, "row.names") <- c(1L, 2L)

aDataFrame <- data.frame(
  var1 = c(1, 2),
  var2 = c("hello", "world")
)

identical(diyDataFrame, aDataFrame)
#> [1] TRUE

Because data frames are lists the copy on modify behavior of lists applies to data frames.

Modifying columns of copied data frames

If you modify a column only the reference to that column needs to change:

dataFrame1 <- data.frame(
  var1 = c("hello", "world"),
  var2 = c(0.01, 0.03)
)
lobstr::ref(dataFrame1)
#> █ [1:0x5581d2b94da8] <list> 
#> ├─var1 = [2:0x5581d2b95de8] <character> 
#> └─var2 = [3:0x5581d2b95d68] <double> 

# copy the data.frame
dataFrame2 <- dataFrame1
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581d2b94da8] <list> 
#> ├─var1 = [2:0x5581d2b95de8] <character> 
#> └─var2 = [3:0x5581d2b95d68] <double> 
#>  
#> [1:0x5581d2b94da8]
# `dataFrame2` has the same memory address as `dataFrame1`

# modify a column
dataFrame2$var2 <- c(0.05, 0.01)
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581d2b94da8] <list> 
#> ├─var1 = [2:0x5581d2b95de8] <character> 
#> └─var2 = [3:0x5581d2b95d68] <double> 
#>  
#> █ [4:0x5581d2cc22e8] <list> 
#> ├─var1 = [2:0x5581d2b95de8] 
#> └─var2 = [5:0x5581d2cc2428] <double> 
# `dataFrame2` gets a new memory address
# the first object of `dataFrame2` still references the same
# memory address as the first object of `dataFrame1`

Modifying rows of copied data frames

If you modify a row then every reference needs to change. Every column will copied to a new location in memory.

dataFrame1 <- data.frame(
  var1 = c("hello", "world"),
  var2 = c(0.01, 0.03)
)

# copy the data.frame
dataFrame2 <- dataFrame1
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581cf243c38] <list> 
#> ├─var1 = [2:0x5581cf7578d8] <character> 
#> └─var2 = [3:0x5581cf7579d8] <double> 
#>  
#> [1:0x5581cf243c38] 

# modify a row
dataFrame2 <- list("hi", 0.9)
lobstr::ref(dataFrame1, dataFrame2)
#> █ [1:0x5581cf243c38] <list> 
#> ├─var1 = [2:0x5581cf7578d8] <character> 
#> └─var2 = [3:0x5581cf7579d8] <double> 
#>  
#> █ [4:0x5581cf5f9458] <list> 
#> ├─[5:0x5581cdb11e40] <character> 
#> └─[6:0x5581cdb11e78] <double>
# every reference in `dataFrame2` has changed

The global string pool

I think the “global string pool” concept is referring to the CHARSCP chache.

All elements of character vectors point to unique vales in the global string pool:

x <- c("hello", "world", "hello")
lobstr::ref(
  x = x,
  character = TRUE
)
#> █ [1:0x5581ce078278] <character> 
#> ├─[2:0x5581c9b6e798] <string: "hello"> 
#> ├─[3:0x5581d0d616b8] <string: "world"> 
#> └─[2:0x5581c9b6e798] 
# the third element has the same memory
# reference as the first

This means repetition in character vectors uses less memory:

x <- c("hello", "world")
lobstr::obj_size(x)
#> 176 B
lobstr::obj_size(rep(x, 10))
#> 320 B
# the character vector repeated 10 times does not use
# x10 the memory

Functional programming

Memory management

Object-oriented programming

There are multiple OOP systems in R:

OOP System	Description
`S3`	Provided by base R.Allows functions to return results that [rich results] and nicely formatted display.Used throughout base R.Need to use if extening base R functions to work with different inputs.
`S4`	Provided by base R.
`R6`	Provided by the `R6` package.Similar to reference classes in base R (`setRefClass()`, `getRefClass()`).Allows you to avoid R’s copy-on-modify behaviour.

People have different preferences for the three systems.

Some OOP terminology

Policpolymorphism

polymorphism - consider a function’s interface seperatly from its implementation. Different types of input can use the same function form. For example summary() gives different outputs depending on the type of variable probided (numeric or factor).

Encapsulation

encapsulation - provide users with an interface that is independent of how an object is internally implemented.

Class

A class describes what and object is.

Method

A method describes what an object does.