2.1 Reading data into R

Data usually comes in the form of a table, such as a spreadsheet or database. In the world of the tidyverse, a table read into R gets called a tibble.

A common format in which to receive data is CSV (comma separated values). CSV is an uncomplicated spreadsheet with no formatting. It is just a single table with rows and columns (no worksheets or formulas). Furthermore, you don’t need special software to quickly view a CSV file - a text editor will do, and that includes RStudio.

For example, look at “example_data.csv” in the healthyr project’s folder in Figure 2.1 (this is the Files pane at the bottom-right corner of your RStudio).

View or import a data file.

FIGURE 2.1: View or import a data file.

Clicking on a data file gives us two options: “View File” or “Import Dataset”.

We will show you how to use the Import Dataset interface in a bit, but for standard CSV files, we don’t usually bother with the Import interface and just type in (or copy from a previous script):

example_data <- read_csv("example_data.csv")

There are a couple of things to say about the first R code chunk of this book. First and foremost: do not panic. Yes, if you’re used to interacting with data by double-clicking on a spreadsheet that just opens up, then the above R code does seem a bit involved.

However, running the example above also has an immediate visual effect. As soon as you click Run (or press Ctrl+Enter/Command+Enter), the dataset immediately shows up in your Environment and opens in a Viewer. You can have a look and scroll through the same way you would in Excel or similar.

So what’s actually going on in the R code above:

  • We load the tidyverse packages (as covered in the first chapter of this book).
  • We have a CSV file called “example_data.csv” and are using read_csv() to read it into R.
  • We are using the assignment arrow <- to save it into our Environment using the same name: example_data.
  • The View(example_data) line makes it pop up for us to view it. Alternatively, click on example_data in the Environment to achieve the exact same thing.

More about the assignment arrow (<-) and naming things in R are covered later in this chapter. Do not worry if everything is not crystal clear just now.

2.1.1 Import Dataset interface

In the read_csv() example above, we read in a file that was in a specific (but common) format.

However, if your file uses semicolons instead of commas, or commas instead of dots, or a special number for missing values (e.g., 99), or anything else weird or complicated, then we need a different approach.

RStudio’s Import Dataset interface (Figure 2.1) can handle all of these and more.

Import: Some of the special settings your data file might have.

FIGURE 2.2: Import: Some of the special settings your data file might have.

After using the Import Dataset window, copy-paste the resulting code into your script.

FIGURE 2.3: After using the Import Dataset window, copy-paste the resulting code into your script.

After selecting the specific options to import a particular file, a friendly preview window will show whether R properly understands the format of your data.

DO NOT BE tempted to press the Import button.

Yes, this will read in your dataset once, but means you have to reselect the options every time you come back to RStudio. Instead, copy-paste the code (e.g., Figure 2.3) into your R script. This way you can use it over and over again.

Ensuring that all steps of an analysis are recorded in scripts makes your workflow reproducible by your future self, colleagues, supervisors, and extraterrestrials.

The Import Dataset button can also help you to read in Excel, SPSS, Stata, or SAS files (instead of read_csv(), it will give you read_excel(), read_sav(), read_stata(), or read_sas()).

If you’ve used R before or are using older scripts passed by colleagues, you might see read.csv() rather than read_csv(). Note the dot rather than the underscore.

In short, read_csv() is faster and more predictable and in all new scripts is to be recommended.

In existing scripts that work and are tested, we do not recommend that you start replacing read.csv() with read_csv(). For instance, read_csv() handles categorical variables differently.2 An R script written using the read.csv() might not work as expected any more if just replaced with read_csv().

Do not start updating and possibly breaking existing R scripts by replacing base R functions with the tidyverse equivalents we show here. Do use the modern functions in any new code you write.

2.1.2 Reading in the Global Burden of Disease example dataset

In the next few chapters of this book, we will be using the Global Burden of Disease datasets. The Global Burden of Disease Study (GBD) is the most comprehensive worldwide observational epidemiological study to date. It describes mortality and morbidity from major diseases, injuries and risk factors to health at global, national and regional levels.3

GBD data are publicly available from the website. Table 2.1 and Figure 2.4 show a high level version of the project data with just 3 variables: cause, year, deaths_millions (number of people who die of each cause every year). Later, we will be using a longer dataset with different subgroups and we will show you how to summarise comprehensive datasets yourself.

gbd_short <- read_csv("data/global_burden_disease_cause-year.csv")
TABLE 2.1: Deaths per year from three broad disease categories (short version of the Global Burden of Disease example dataset).
year cause deaths_millions
1990 Communicable diseases 15.36
1990 Injuries 4.25
1990 Non-communicable diseases 26.71
1995 Communicable diseases 15.11
1995 Injuries 4.53
1995 Non-communicable diseases 29.27
2000 Communicable diseases 14.81
2000 Injuries 4.56
2000 Non-communicable diseases 31.01
2005 Communicable diseases 13.89
2005 Injuries 4.49
2005 Non-communicable diseases 32.87
2010 Communicable diseases 12.51
2010 Injuries 4.69
2010 Non-communicable diseases 35.43
2015 Communicable diseases 10.88
2015 Injuries 4.46
2015 Non-communicable diseases 39.28
2017 Communicable diseases 10.38
2017 Injuries 4.47
2017 Non-communicable diseases 40.89
Line and bar charts: Cause of death by year (GBD). Data in (B) are the same as (A) but stacked to show the total of all causes.

FIGURE 2.4: Line and bar charts: Cause of death by year (GBD). Data in (B) are the same as (A) but stacked to show the total of all causes.

  1. It does not silently convert strings to factors, i.e., it defaults to stringsAsFactors = FALSE. For those not familiar with the terminology here - don’t worry, we will cover this in just a few sections.↩︎

  2. Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2017 (GBD 2017) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2018. Available from http://ghdx.healthdata.org/gbd-results-tool.↩︎