11.7 File structure and workflow

As projects get bigger, it is really important that they are well organised. This will avoid errors and make collaboration easier.

Here is our suggested approach.

All projects must reside within an RStudio Project that has a meaningful name (not MyProject!). Never work within the (Home) or root directory. Projects should be initiated with a Git repository for version control (see Chapter 13).

Structure your directory sensibly.

proj/
- scripts/
- data/
- doc/
- figs/
- 00_analysis.Rmd

scripts/ contains all the .R script files used for data cleaning/preparation.
data/ contains all raw data, such as .csv files.
doc/ contains all output documents and reports, such as .doc and .PDF files.
figs/ contains all figures and plots.
00_analysis.Rmd or 00_analysis.R is the actual main working file, and we keep this in the main project directory. We prefix this with 00 so it is always obvious which is the main file.

11.7.1 Data cleaning

This is all done in specific .R files, which are kept in the scripts/ folder, and are clearly labelled, e.g.

scripts/0_source_all.R
scripts/01_data_upload.R
scripts/02_make_factors.R
scripts/03_duplicate_records.R

For instance, 01_data_upload.R may look like this.

Note use of the here::here(). This is a very useful package which “just works” when it comes to finding your files. Do not change the working directory (using setwd()) from the project root - there is never a reason to do this. here::here() is paricularly useful when sharing projects between Linux, Mac and Windows machines, which have different conventions for file paths.

02_make_factors.R is our example second file, but it could be anything you want. It could look something like this.

All these files can then be brought together in a single file to source(). This function is used to run code from a file.

0_source_all.R might look like this:

You can now bring your robustly prepared data into your analysis file, which can be .R or .Rmd if you are working in a Notebook. We call this 00_analysis.Rmd and it always sits in the project root director. You have two options in bringing in the data.

  1. source() the data again
  • this is useful if the data is changing
  • may take a long time if it is a large dataset with lots of manipulations
  1. load() from the data/ folder
  • usually quicker, but loads the static dataset which was created the last time your ran 0_soruce_all.R

The two options look like this:

Why go to all this bother?

It comes from many years of finding errors due to badly organised projects. It is clearly not needed for a small quick project, but is essential for any major work.

At the very start of an analysis (as in the first day), we will start working in a single file. We will quickly move chunks of data cleaning / preparation code into single files as we go.

Compartmentalising the data cleaning helps in debugging. Sourced files can be ‘commented out’ (adding a # to a line in the 0_source_all.R file) if you wish to exclude the manipulations in that particular file.

Most important, it helps with collaboration. When multiple people are working on a project, it is essential that communication is good and everybody is working to the same overall plan.