RAP in Practice

Fit for the Future - Day 4

Alice Byers

Data Innovation Team, Data Division

16 November 2023

About me

  • Reproducible Analytical Pipeline (RAP) developer

  • Background in statistics and data analysis

Is RAP relevent to me and my team?

  • Would it be a nightmare to have to go back and rerun your process from the beginning if you found a mistake?

  • Do you have to make a lot of manual edits to code before each run?

  • Is there a lot of repetition in your code?

  • Would a new person find it difficult to understand the process?

Features of RAP

In order to achieve the full benefits, at a minimum a RAP must:

  • Minimise manual steps

  • Be built using open-source software; e.g. R, python

  • Be peer reviewed by colleagues

  • Be version controlled; e.g. git

  • Be open to anyone; e.g. code published on GitHub

  • Follow good practice for quality assurance

  • Contain well-commented code and have documentation embedded

Organised folder structure


  • Keep everything you need to run your code within one repository

  • Plan ahead for how you’re going to organise future data submissions, outputs, etc.

Organised folder structure

A good place to start:

    my-project/
    ├── code/
    │   ├── 00_setup.R
    │   └── 01_clean_data.R
    ├── functions/
    ├── data/
    ├── lookups/
    ├── outputs/
    │   ├── 2022/
    │   └── 2023/
    └── README.md

Standardised naming conventions

  • Underscores and dashes instead of spaces

  • All lower case

  • Date stamp data and output files

  • Number R scripts

  • Document the agreed naming convention

Standardised naming conventions


Examples:

  • 2023-11-16_attendance.rds

  • 2022_school-report.html

  • 01_prepare-data.R

Relative file paths

  • Avoid hard-coding file paths

  • Use RStudio Projects

  • Use the here package to define file paths relative to your project folder

Writing manual processes as code

  • Creating new folders

  • Updating parameters; e.g. dates, geographies

  • Creating outputs; e.g. data visualisation, reports, spreadsheets

Functions


  • Don’t repeat yourself

  • Use function arguments to re-use the function for ‘similar’ actions

  • Keep as simple as possible

Documentation

  • Code comments to provide context

  • Include a README file

    • Description of the process

    • Requirements and dependencies

    • Guidance to run the process

    • Contact details

Version control

  • Alternative to saving multiple copies of files to keep version history

  • When a change is made to a file, create a Git ‘commit’ to record:

    • what change was made,
    • when the change was made,
    • why the change was made, and
    • who made the change.

Open code

  • Host a Git repository on GitHub and make it public

  • Increase trust by making analysis transparent

  • Facilitate peer review

  • Make it easier for others to reuse code

Summary

Organised folder structure


Functions

Standardised naming conventions


Documentation

Relative file paths



Version control

Writing manual processes as code


Open code

Where to start

  • Open-source software
  • Would it be a nightmare to have to go back and rerun your process from the beginning if you found a mistake?

    • Git
  • Do you have to make a lot of manual edits to code before each run?

    • Set parameters in a setup script

Where to start

  • Is there a lot of repetition in your code?

    • Functions
  • Would a new person find it difficult to understand the process?

    • Documentation