Fit for the Future - Day 4
Alice Byers
Data Innovation Team, Data Division
16 November 2023
Reproducible Analytical Pipeline (RAP) developer
Background in statistics and data analysis
Would it be a nightmare to have to go back and rerun your process from the beginning if you found a mistake?
Do you have to make a lot of manual edits to code before each run?
Is there a lot of repetition in your code?
Would a new person find it difficult to understand the process?
In order to achieve the full benefits, at a minimum a RAP must:
Minimise manual steps
Be built using open-source software; e.g. R, python
Be peer reviewed by colleagues
Be version controlled; e.g. git
Be open to anyone; e.g. code published on GitHub
Follow good practice for quality assurance
Contain well-commented code and have documentation embedded
Keep everything you need to run your code within one repository
Plan ahead for how you’re going to organise future data submissions, outputs, etc.
A good place to start:
my-project/
├── code/
│ ├── 00_setup.R
│ └── 01_clean_data.R
├── functions/
├── data/
├── lookups/
├── outputs/
│ ├── 2022/
│ └── 2023/
└── README.md
Underscores and dashes instead of spaces
All lower case
Date stamp data and output files
Number R scripts
Document the agreed naming convention
Examples:
2023-11-16_attendance.rds
2022_school-report.html
01_prepare-data.R
Avoid hard-coding file paths
Use RStudio Projects
Use the here
package to define file paths relative to your project folder
Creating new folders
Updating parameters; e.g. dates, geographies
Creating outputs; e.g. data visualisation, reports, spreadsheets
Don’t repeat yourself
Use function arguments to re-use the function for ‘similar’ actions
Keep as simple as possible
Code comments to provide context
Include a README file
Description of the process
Requirements and dependencies
Guidance to run the process
Contact details
Alternative to saving multiple copies of files to keep version history
When a change is made to a file, create a Git ‘commit’ to record:
Host a Git repository on GitHub and make it public
Increase trust by making analysis transparent
Facilitate peer review
Make it easier for others to reuse code
Organised folder structure
Standardised naming conventions
Relative file paths
Writing manual processes as code
Would it be a nightmare to have to go back and rerun your process from the beginning if you found a mistake?
Do you have to make a lot of manual edits to code before each run?
Is there a lot of repetition in your code?
Would a new person find it difficult to understand the process?
Civil Service RAP Strategy and Scottish Government Implementation Plan
Blog: How we saved 3 analysts 6 weeks of copying and pasting
Email me – I’m always happy to talk about RAP!