12.7 File structure and workflow

As projects get bigger, it is important that they are well organised. This will avoid errors and make collaboration easier.

What is absolutely compulsory is that your analysis must reside within an RStudio Project and have a meaningful name (not MyProject! or Analysis1). Creating a New Project on RStudio will automatically create a new folder for itself (unless you choose “Existing Folder”). Never work within a generic Home or Documents directory. Furthermore, do not change the working directory using setwd() - there is no reason to do this, and it usually makes your analysis less reproducible. Once you’re starting to get the hang of R, you should initiate all Projects with a Git repository for version control (see Chapter 13).

For smaller projects with 1-2 data files, a couple of scripts and an R Markdown document, it is fine to keep them all in the Project folder (but we repeat, each Project must have its own folder). Once the number of files grows beyond that, you should add separate folders for different types of files.

Here is our suggested approach. Based on the nature of your analyses, the number of folders may be smaller or greater than this, and they may be called something different.

proj/
- scripts/
- data_raw/
- data_processed/
- figures/
- 00_analysis.Rmd

scripts/ contains all the .R script files used for data cleaning/preparation. If you only have a few scripts, it’s fine to not have this one and just keep the .R files in the project folder (where 00_analysis.Rmd is in the above example). data_raw/ contains all raw data, such as .csv files, data_processed/ contains data you’ve taken from raw, cleaned, modified, joined or otherwise changed using R scripts. figures/ may contain plots (e.g., .png, .jpg, .pdf) 00_analysis.Rmd or 00_analysis.R is the actual main working file, and we keep this in the main project directory.

Your R scripts should be numbered using double digits, and they should have meaningful names, for example:

scripts/00_source_all.R
scripts/01_read_data.R
scripts/02_make_factors.R
scripts/03_duplicate_records.R

For instance, 01_read_data.R may look like this.

# Melanoma project
## Data pull

# Get data
library(readr)
melanoma <- read_csv(
  here::here("data_raw", "melanoma.csv")
)

# Other basic reccoding or renaming functions here

# Save
save(melanoma, file = 
  here::here("data_processed", "melanoma_working.rda")
)

Note the use of here::here(). RStudio projects manage working directories in a better way than setwd(). here::here() is useful when sharing projects between Linux, Mac and Windows machines, which have different conventions for file paths.

For instance, on a Mac you would otherwise do read_csv("data/melanoma.csv") and on Windows you would have to do read_csv("data\melanoma.csv"). Having to include either / (GNU/Linux, macOS) or \ (Windows) in your script means it will have to be changed by hand when running on a different system. What here::here("data_raw", "melanoma.csv"), however, works on any system, as it will use an appropriate one ‘behind the scenes’ without you having to change anything.

02_make_factors.R is our example second file, but it could be anything you want. It could look something like this.

# Melanoma project
## Create factors
library(tidyverse)

load(
  here::here("data_processed", "melanoma_working.rda")
)

## Recode variables
melanoma <- melanoma %>%
  mutate(
    sex = factor(sex) %>% 
      fct_recode("Male" = "1", 
                 "Female" = "0")
  )

# Save
save(melanoma, file = 
  here::here("data", "melanoma_working.rda")
)

All these files can then be brought together in a single file to source(). This function is used to run code from a file.

00_source_all.R might look like this:

# Melanoma project
## Source all

source( here::here("scripts", "01_data_upload.R") )
source( here::here("scripts", "02_make_factors.R") ) 
source( here::here("scripts", "03_duplicate_records.R") ) 

# Save
save(melanoma, file = 
  here::here("data_processed", "melanoma_final.rda")
)

You can now bring your robustly prepared data into your analysis file, which can be .R or .Rmd if you are working in a Notebook. We call this 00_analysis.Rmd and it always sits in the project root director. You have two options in bringing in the data.

source("00_source_all.R") to re-load and process the data again
- this is useful if the data is changing
- may take a long time if it is a large dataset with lots of manipulations
load("melanoma_final.rda") from the data_processed/ folder
- usually quicker, but loads the dataset which was created the last time you ran 00_source_all.R

Remember: For .R files use source(), for .rda files use load().

The two options look like this:

---
title: "Melanoma analysis"
output: html_notebook
---

```{r get-data-option-1, echo=FALSE}
load(
  here:here("data", "melanoma_all.rda")
)
```

```{r get-data-option-2, echo=FALSE}
source(
  here:here("R", "00_source_all.R")
)

12.7.1 Why go to all this bother?

It comes from many years of finding errors due to badly organised projects. It is not needed for a small quick project, but is essential for any major work.

At the very start of an analysis (as in the first day), we will start working in a single file. We will quickly move chunks of data cleaning / preparation code into separate files as we go.

Compartmentalising the data cleaning helps with finding and dealing with errors (‘debugging’). Sourced files can be ‘commented out’ (adding a # to a line in the 00_source_all.R file) if you wish to exclude the manipulations in that particular file.

Most important, it helps with collaboration. When multiple people are working on a project, it is essential that communication is good and everybody is working to the same overall plan.