2.3 Objects and functions

There are two fundamental concepts in statistical programming that are important to get straight - objects and functions. The most common object you will be working with is a dataset. This is usually something with rows and columns much like the example in Table 2.3.

TABLE 2.3: Example of data in columns and rows, including missing values denoted NA (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.
id	sex	var1	var2	var3
1	Male	4	NA	2
2	Female	1	4	1
3	Female	2	5	NA
4	Male	3	NA	NA

To get the small and made-up “dataset” into your Environment, copy and run this code⁵:

library(tidyverse)
mydata <- tibble(
  id   = 1:4,
  sex  = c("Male", "Female", "Female", "Male"),
  var1 = c(4, 1, 2, 3),
  var2 = c(NA, 4, 5, NA),
  var3 = c(2, 1, NA, NA)
)

Data can live anywhere: on paper, in a spreadsheet, in an SQL database, or in your R Environment. We usually initiate and interface with R using RStudio, but everything we talk about here (objects, functions, environment) also work when RStudio is not available, but R is. This can be the case if you are working on a supercomputer that can only serve the R Console and not RStudio.

2.3.1 `data frame/tibble`

So, regularly shaped data in rows and columns is called a table when it lives outside R, but once you read/import it into R it gets called a tibble. If you’ve used R before, or get given a piece of code that uses read.csv() instead of read_csv(), you’ll have come across the term data frame.⁶

A tibble is the modern/tidyverse version of a data frame in R. In most cases, data frames and tibbles work interchangeably, but tibbles often work better. Another great alternative to base R data frames are data tables. In this book, and for most of our day-to-day work these days, we will use tibbles.

2.3.2 Naming objects

When you read data into R, you want it to show up in the Environment tab. Everything in your Environment needs to have a name. You will likely have many objects such as tibbles going on at the same time. Note that tibble is what the thing is, rather than its name. This is the ‘class’ of an object.

To keep our code examples easy to follow, we call our example tibble mydata. In a real analysis, you should give your tibbles meaningful names, e.g., patient_data, lab_results, annual_totals, etc. Object names can’t have spaces in it, which is why we use the underscore (_) to separate words. Object names can include numbers, but they can’t start with a number: so labdata2019 works, 2019labdata does not.

So, the tibble named mydata is an example of an object that can be in the Environment of your R Session:

mydata

## # A tibble: 4 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     3 Female     2     5    NA
## 4     4 Male       3    NA    NA

2.3.3 Function and its arguments

A function is a procedure which takes some information (input), does something to it, and passes back the modified information (output).

A simple function that can be applied to numeric data is mean().

R functions always have round brackets after their name. This is for two reasons. First, it easily differentiates them as functions - you will get used to reading them like this.
Second, and more importantly, we can put arguments in these brackets.

Arguments can also be thought of as input. In data analysis, the most common input for a function is data. For instance, we need to give mean() some data to average over. It does not make sense (nor will it work) to feed mean() the whole tibble with multiple columns, including patient IDs and a categorical variable (sex).

To quickly extract a single column, we use the $ symbol like this:

mydata$var1

## [1] 4 1 2 3

You can ignore the ## [1] at the beginning of the extracted values - this is something that becomes more useful when printing multiple lines of data as the number in the square brackets keeps count on how many values we are seeing.

We can then use mydata$var1 as the first argument of mean() by putting it inside its brackets:

mean(mydata$var1)

## [1] 2.5

which tells us that the mean of var1 (4, 1, 2, 3) is 2.5. In this example, mydata$var1 is the first and only argument to mean().

But what happens if we try to calculate the average value of var2 (NA, 4, 5, NA) (remember, NA stands for Not Applicable/Available and is used to denote missing data):

mean(mydata$var2)

## [1] NA

So why does mean(mydata$var2) return NA (“not available”) rather than the mean of the values included in this column? That is because the column includes missing values (NAs), and R does not want to average over NAs implicitly. It is being cautious - what if you didn’t know there were missing values for some patients? If you wanted to compare the means of var1 and var2 without any further filtering, you would be comparing samples of different sizes.

We might expect to see an NA if we tried to, for example, calculate the average of sex. And this is indeed the case:

mean(mydata$sex)

## Warning in mean.default(mydata$sex): argument is not numeric or logical:
## returning NA

## [1] NA

Furthermore, R also gives us a pretty clear Warning suggesting it can’t compute the mean of an argument that is not numeric or logical. The sentence actually reads pretty fun, as if R was saying it was not logical to calculate the mean of something that is not numeric.

But, R is actually saying that it is happy to calculate the mean of two types of variables: numerics or logicals, but what you have passed is neither.

If you decide to ignore the NAs and want to calculate the mean anyway, you can do so by adding this argument to mean():

mean(mydata$var2, na.rm = TRUE)

## [1] 4.5

Adding na.rm = TRUE tells R that you are happy for it to calculate the mean of any existing values (but to remove - rm - the NA values). This ‘removal’ excludes the NAs from the calculation, it does not affect the actual tibble (mydata) holding the dataset.

R is case sensitive, so na.rm, not NA.rm etc. There is, however, no need to memorize how the arguments of functions are exactly spelled - this is what the Help tab is for (press F1 when the cursor is on the name of the function). Help pages are built into R, so an internet connection is not required for this.

Make sure to separate multiple arguments with commas or R will give you an error of Error: unexpected symbol.

Finally, some functions do not need any arguments to work. A good example is the Sys.time() which returns the current time and date. This is useful when using R to generate and update reports automatically. Including this means you can always be clear on when the results were last updated.

Sys.time()

## [1] "2021-01-15 11:56:13 GMT"

2.3.4 Working with objects

To save an object in our Environment we use the assignment arrow:

a <- 103

This reads: the object a is assigned value 103. <- is called “the arrow assignment operator”, or “assignment arrow” for short.

Keyboard shortcuts to insert <-:
Windows: Alt-
macOS: Option-

You know that the assignment worked when it shows up in the Environment tab. If we now run a just on its own, it gets printed back to us:

## [1] 103

Similarly, if we run a function without assignment to an object, it gets printed but not saved in your Environment:

seq(15, 30)

##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

seq() is a function that creates a sequence of numbers (+1 by default) between the two arguments you pass to it in its brackets. We can assign the result of seq(15, 30) into an object, let’s call it example_sequence:

example_sequence <- seq(15, 30)

Doing this creates example_sequence in our Environment, but it does not print it. To get it printed, run it on a separate line like this:

example_sequence

##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

If you save the results of an R function in an object, it does not get printed. If you run a function without the assignment (<-), its results get printed, but not saved as an object.

Finally, R doesn’t mind overwriting an existing object, for example:

example_sequence <- example_sequence/2

example_sequence

##  [1]  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
## [16] 15.0

Notice how we then include the variable on a new line to get it printed as well as overwritten.

2.3.5 `<-` and `=`

Note that many people use = instead of <-. Both <- and = can save what is on the right into an object with named on the left. Although <- and = are interchangeable when saving an object into your Environment, they are not interchangeable when used as function argument. For example, remember how we used the na.rm argument in the mean() function, and the result got printed immediately? If we want to save the result into an object, we’ll do this, where mean_result could be any name you choose:

mean_result <- mean(mydata$var2, na.rm = TRUE)

Note how the example above uses both operators: the assignment arrow for saving the result to the Environment, the = equals operator for setting an argument in the mean() function (na.rm = TRUE).

2.3.6 Recap: object, function, input, argument

To summarise, objects and functions work hand in hand. Objects are both an input as well as the output of a function (what the function returns).
When passing data to a function, it is usually the first argument, with further arguments used to specify behaviour.
When we say “the function returns”, we are referring to its output (or an Error if it’s one of those days).
The returned object can be different to its input object. In our mean() examples above, the input object was a column (mydata$var1: 4, 1, 2, 3), whereas the output was a single value: 2.5.
If you’ve written a line of code that doesn’t include the assignment arrow (<-), its results would get printed. If you use the assignment arrow, an object holding the results will get saved into the Environment.

c() stands for combine and will be introduced in more detail later in this chapter↩︎
read.csv() comes with base R, whereas read_csv() comes from the readr package within the tidyverse. We recommend using read_csv().↩︎