2.3 Objects and functions
There are two fundamental concepts in statistical programming that are important to get straight - objects and functions. The most common object you will be working with is a dataset. This is usually something with rows and columns much like the example in Table 2.3.
To get the small and made-up “dataset” into your Environment, copy and run this code5:
library(tidyverse) mydata <- tibble( id = 1:4, sex = c("Male", "Female", "Female", "Male"), var1 = c(4, 1, 2, 3), var2 = c(NA, 4, 5, NA), var3 = c(2, 1, NA, NA) )
Data can live anywhere: on paper, in a spreadsheet, in an SQL database, or in your R Environment. We usually initiate and interface with R using RStudio, but everything we talk about here (objects, functions, environment) also work when RStudio is not available, but R is. This can be the case if you are working on a supercomputer that can only serve the R Console and not RStudio.
So, regularly shaped data in rows and columns is called a table when it lives outside R, but once you read/import it into R it gets called a tibble.
If you’ve used R before, or get given a piece of code that uses
read.csv() instead of
read_csv(), you’ll have come across the term
tibble is the modern/tidyverse version of a data frame in R.
In most cases,
data frames and
tibbles work interchangeably, but
tibbles often work better.
Another great alternative to base R
data frames are
In this book, and for most of our day-to-day work these days, we will use
2.3.2 Naming objects
When you read data into R, you want it to show up in the Environment tab. Everything in your Environment needs to have a name. You will likely have many objects such as tibbles going on at the same time. Note that tibble is what the thing is, rather than its name. This is the ‘class’ of an object.
To keep our code examples easy to follow, we call our example tibble
In a real analysis, you should give your tibbles meaningful names, e.g.,
Object names can’t have spaces in it, which is why we use the underscore (
_) to separate words.
Object names can include numbers, but they can’t start with a number: so
2019labdata does not.
So, the tibble named
mydata is an example of an object that can be in the Environment of your R Session:
## # A tibble: 4 x 5 ## id sex var1 var2 var3 ## <int> <chr> <dbl> <dbl> <dbl> ## 1 1 Male 4 NA 2 ## 2 2 Female 1 4 1 ## 3 3 Female 2 5 NA ## 4 4 Male 3 NA NA
2.3.3 Function and its arguments
A function is a procedure which takes some information (input), does something to it, and passes back the modified information (output).
A simple function that can be applied to numeric data is
R functions always have round brackets after their name.
This is for two reasons.
First, it easily differentiates them as functions - you will get used to reading them like this.
Second, and more importantly, we can put arguments in these brackets.
Arguments can also be thought of as input.
In data analysis, the most common input for a function is data.
For instance, we need to give
mean() some data to average over.
It does not make sense (nor will it work) to feed
mean() the whole tibble with multiple columns, including patient IDs and a categorical variable (
To quickly extract a single column, we use the
$ symbol like this:
##  4 1 2 3
You can ignore the
##  at the beginning of the extracted values - this is something that becomes more useful when printing multiple lines of data as the number in the square brackets keeps count on how many values we are seeing.
We can then use
mydata$var1 as the first argument of
mean() by putting it inside its brackets:
##  2.5
which tells us that the mean of
var1 (4, 1, 2, 3) is 2.5.
In this example,
mydata$var1 is the first and only argument to
But what happens if we try to calculate the average value of
var2 (NA, 4, 5, NA) (remember,
NA stands for Not Applicable/Available and is used to denote missing data):
##  NA
So why does
NA (“not available”) rather than the mean of the values included in this column?
That is because the column includes missing values (
NAs), and R does not want to average over
It is being cautious - what if you didn’t know there were missing values for some patients?
If you wanted to compare the means of
var2 without any further filtering, you would be comparing samples of different sizes.
We might expect to see an
NA if we tried to, for example, calculate the average of
And this is indeed the case:
## Warning in mean.default(mydata$sex): argument is not numeric or logical: ## returning NA
##  NA
Furthermore, R also gives us a pretty clear Warning suggesting it can’t compute the mean of an argument that is not numeric or logical. The sentence actually reads pretty fun, as if R was saying it was not logical to calculate the mean of something that is not numeric.
But, R is actually saying that it is happy to calculate the mean of two types of variables: numerics or logicals, but what you have passed is neither.
If you decide to ignore the NAs and want to calculate the mean anyway, you can do so by adding this argument to
##  4.5
na.rm = TRUE tells R that you are happy for it to calculate the mean of any existing values (but to remove -
rm - the
This ‘removal’ excludes the NAs from the calculation, it does not affect the actual tibble (
mydata) holding the dataset.
R is case sensitive, so
There is, however, no need to memorize how the arguments of functions are exactly spelled - this is what the Help tab is for (press
F1 when the cursor is on the name of the function).
Help pages are built into R, so an internet connection is not required for this.
Make sure to separate multiple arguments with commas or R will give you an error of
Error: unexpected symbol.
Finally, some functions do not need any arguments to work.
A good example is the
Sys.time() which returns the current time and date.
This is useful when using R to generate and update reports automatically.
Including this means you can always be clear on when the results were last updated.
##  "2021-01-15 11:56:13 GMT"
2.3.4 Working with objects
To save an object in our Environment we use the assignment arrow:
This reads: the object
a is assigned value 103.
<- is called “the arrow assignment operator”, or “assignment arrow” for short.
Keyboard shortcuts to insert
You know that the assignment worked when it shows up in the Environment tab.
If we now run
a just on its own, it gets printed back to us:
##  103
Similarly, if we run a function without assignment to an object, it gets printed but not saved in your Environment:
##  15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
seq() is a function that creates a sequence of numbers (+1 by default) between the two arguments you pass to it in its brackets.
We can assign the result of
seq(15, 30) into an object, let’s call it
Doing this creates
example_sequence in our Environment, but it does not print it.
To get it printed, run it on a separate line like this:
##  15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
If you save the results of an R function in an object, it does not get printed. If you run a function without the assignment (
<-), its results get printed, but not saved as an object.
Finally, R doesn’t mind overwriting an existing object, for example:
##  7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 ##  15.0
Notice how we then include the variable on a new line to get it printed as well as overwritten.
Note that many people use
= instead of
= can save what is on the right into an object with named on the left.
= are interchangeable when saving an object into your Environment, they are not interchangeable when used as function argument.
For example, remember how we used the
na.rm argument in the
mean() function, and the result got printed immediately?
If we want to save the result into an object, we’ll do this, where
mean_result could be any name you choose:
Note how the example above uses both operators: the assignment arrow for saving the result to the Environment, the
= equals operator for setting an argument in the
mean() function (
na.rm = TRUE).
2.3.6 Recap: object, function, input, argument
To summarise, objects and functions work hand in hand. Objects are both an input as well as the output of a function (what the function returns).
When passing data to a function, it is usually the first argument, with further arguments used to specify behaviour.
When we say “the function returns”, we are referring to its output (or an Error if it’s one of those days).
The returned object can be different to its input object. In our
mean()examples above, the input object was a column (
mydata$var1: 4, 1, 2, 3), whereas the output was a single value: 2.5.
If you’ve written a line of code that doesn’t include the assignment arrow (
<-), its results would get printed. If you use the assignment arrow, an object holding the results will get saved into the Environment.