Reproducible Analytical Pipelines

Laurie Platt, Giles Robinson, & Roland Lovatt

January 2024

Show & Tell series

Show & Tell outline


  1. What is RAP?
  2. RAP history & resources
  3. SCC example (Giles)
  4. RAP & PAS (Roland)
  5. RAP at Sheffield City Council
  6. Q&A (facilitated by Giles)

a. What is RAP?



“Reproducible Analytical Pipelines (RAP) is a cross-government movement promoted by the Government Analysis Function. It is a way of doing analysis so that it meets these principles, making processes more open and robust, enabling better quality assurance, improving knowledge management and business continuity. This ultimately increases the quality and trustworthiness of our analytical publications and facilitates innovation and collaboration within and outside government.”1

What is an Analytical Pipeline?

The ideal scenario


What is an Analytical Pipeline?

The reality…

What are the issues?


  • Lots of manual steps
  • Hard to reproduce
  • Mistakes are easily made and hard to track
  • The steps aren’t recorded
  • Using multiple independent tools
  • How do we keep track of which file versions people have?

What do we mean by reproducible?


We want to look back and be able to repeat our work easily and quickly.

What are the benefits?

  • Helps build trust
  • Not reliant on single individual
  • Can be adapted and re-used

What is a Reproducible Analytical Pipeline (RAP)?


  • It is easily repeatable
  • It is easily extendable
  • It is automated
  • It minimises mistakes
  • It is fast
  • It builds trust

Principles & practices


Baseline RAP - getting the fundamentals right

  1. Minimises manual steps, for example copy-paste, point-click or drag-drop operations.
  2. Built using an open source code language e.g. R, Python, SQL.
  3. Code is version controlled e.g. Git.
  4. Code is peer reviewed.
  5. Documentation embedded and version controlled within the product e.g. repository includes a README.md file.
  6. Code is published in the open e.g. GitHub.

Further levels of RAP


Silver - implementing best practice

Reusable functions
Testing framework
Coding standards


Gold - analysis as a product

CI/CD (continuous integration and continuous delivery/deployment)
Semantic versioning (MAJOR.MINOR.PATCH e.g. 1.4.1 < 2.0.0)

b. RAP history and resources


2017

A number of government articles promoting open code, and what should and should not be published.

Matt Upson wrote the original blog post on RAP and helped develop “the first RAP”.


2018

The Government Digital Service (GDS) Data Science team continued to develop RAP prior to it moving across to the ONS/Analytical Function.

Matt Gregory put together the original RAP companion and an Introduction to RAP online course.

DevOps and reproducible research


The original blog post on RAP took:

“inspiration from the fields of DevOps and reproducible research1.


The principles and practices of reproducible research are superbly set out in the Turing Way book, from the Alan Turing Institute.


2022

Goldacre Review talks extensively about RAP.

Government RAP strategy is launched. The vision includes:

“Analytical teams in public sector organisations choose to deliver their analysis using the RAP principles by default.”


2023

Government departments (MOD, DfE, ONS etc.) publish RAP implementation plans.

Resources


Government Analysis Function
analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines


NHS Digital
nhsdigital.github.io/rap-community-of-practice


Free book by Bruno Rodrigues
raps-with-r.dev


Some other resources are listed here
scc-pi.github.io/pinsheff/rap.html#further-resources-rap

c. SCC example

Children’s Early Help - positives


End to end data flow:

Data model script: ingest -> combine -> clean -> process -> save
Analysis script: load -> analyse -> visualise -> publish / share


Modular functions


GitHub benefits:

Hosting
Collaboration
Track changes Rolling back
Branching to develop & test distinct features

Children’s Early Help – lessons learned?


Too many cooks?

we ended up with something too big & too complicated
it’s easier to add code than take it out
solutions: annotate, peer review, be more ruthless


Debugging is hard

Children’s Early Help - RAP assessment

d. RAP and PAS

Pipeline evolution in PAS


Historically:

  • Large & complex databases
  • Varied data sources/agreements across government and services
  • Significant reliance upon specialist individual knowledge sets
  • Large numbers of specialised work programmes
  • Many manual processes
  • Narrow & valuable interpersonal networks
  • Use of a wide range of analytical tools – some becoming ‘legacy’

Sheffield Inclusion Centre report example

A half-termly multi-page report developed by a former colleague:

  • Produced & distributed ‘manually’ as a static pdf document using Power BI
  • Some data direct from the OSCAR database
  • Some source tables built using STATA code
  • Time-sensitive data sourcing
  • Contains extensive & detailed break-downs of pupil information for monitoring & support purposes

Challenges:

  • Agreements over key figures
  • Definitions
  • Sources
  • Data quality
  • Coding variations
  • Report structure

Ongoing reproducible pipeline work

Existing products:

  • Always building with someone else’s work – either a complete product, a draft or components
  • Reliance upon historic knowledge & documentation

Disassembly & understanding:

  • Staging
  • Recording
  • Chunking
  • Consultations

Blueprinting – examination & analysis of components:

  • Measurement, tolerances, optimization

Rebuild & New Build:

  • New Power BI report with overall figures
    • Pages containing guidance, database table structures, sources, code used
  • Peer review
  • Automation
  • New coding limited to SQL & DAX
  • Transparent
  • Old report modifications, increased transparency & documentation

e. RAP and Sheffield City Council

Data Platform

  • Discrete steps in our analytical pipeline
  • Shared data, data processing, & development environment
  • Improve reproducibility

Low code?

Data flow (Synapse Analytics)


Power BI doesn’t qualify as RAP

Early opportunities @SCC


  1. product - pipeline - reproducibility

  2. Use R or Python instead of Stata or SPSS

  3. Version control & SQL

  4. Make use of public RAP resources

  5. Pilot/try a RAP

We don’t have to do it all at once


The building blocks of a RAP:

  • Using open-source tools
  • Create reproducible code
  • Version control


... are useful in their own right, each will improve the auditability, speed and quality of your work.





f. Questions?


Any interest in an SCC RAP User Group, to cover R, Python, SQL, version control etc. ?

Appendix

These slides

Made with revealjs
revealjs.com

Using quarto
quarto.org

Source shared on GitHub
at github.com/scc-pi/rapsheff

Published on GitHub Pages
at scc-pi.github.io/rapsheff

What are the benefits?


  • Easy for others to use
  • Others can change and adapt
  • All steps are recorded
    • Including whilst it is built
  • Automated and fast
  • Open and promotes trust

Why open-source instead of proprietary?


Open source tools are:

  • Used by millions - huge supportive online community
  • Flexible to all data sources
  • Free for anyone to use - it is easier to share
  • Flexible to all output types

What is version control?


Tracking the three Ws:

    Who made Which change and Why?


Why use version control?


  • One place to store your code
  • You and collaborators are free to write and develop locally
  • Complete documented history of all changes made
  • Easy to share
  • Your future self will thank you!

What does a RAP look like?

What do we need?


  • Open-source tools
  • Version control with git
  • To consider reproducibility
  • Time to learn

DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle.”1


Azure DevOps is a Microsoft product, a suite of tools that enable DevOps.

MLOPs is the application of DevOps to Machine Learning. It is about automating the building, training, deployment, maintenance, and further development of models. RAP covers a broader range of data analysis output.

dplyr vs SQL

Mapping

Wishes

If we wanted to extend part of the Government RAP strategy vision so that:

“Analytical teams in public sector organisations choose to Sheffield City Council deliver their analysis using the RAP principles by default.”


We would need:

  1. Tools
  2. Senior leadership commitment
  3. Skills, experience, & culture shift
  4. Local Government examples i.e. evidence