Reproducible Analytical Pipelines
Alice Hannah
Data Innovation Team, Scottish Government
23 October 2024
Is RAP relevant to me?
Would it be a nightmare to have to go back and rerun your process from the beginning if you found a mistake?
Do you have to make a lot of manual edits to code before each run?
Is there a lot of repetition in your code?
Would a new person find it difficult to understand the process?
What is RAP?
Automated statistical and analytical processes that are:
Reproducible
Auditable
Efficient
High quality
Features of RAP
In order to achieve these benefits, at a minimum a RAP must:
Minimise manual steps
Be built using open-source software; e.g. R, python
Be peer reviewed by colleagues
Be version controlled; e.g. git
Be open to anyone; e.g. code published on GitHub
Follow good practice for quality assurance
Contain well-commented code and have documentation embedded
RAP Strategy
Case Study - Existing Process
School Information Dashboards
10 data sources
Data cleaned, linked and analysed manually in Excel
Dashboards created in Tableau
Updated twice a year; each update took three weeks of work for three statisticians - longer if errors were found
Case Study - Planning
Engage with SG RAP support team
Define aims – what will success look like?
Mock ups of what dashboards would look like
Planning how best to structure datasets
Work with data providers to improve process
Case Study - RAP Principles Applied
Organised folder structure
Writing manual processes as code
Standardised naming conventions
Functions
Open-source software
Version control using git
Relative file paths
Open code on GitHub
Case Study - Result
Where to start
Where to start
Reproducible Analytical Pipelines Alice Hannah Data Innovation Team, Scottish Government 23 October 2024