9 RAP

9.1 Principles

Reproducible Analytical Pipelines (RAPs) have evolved from the UK Govenment’s Analysis Function and their 2022 Reproducible Analytical Pipelines (RAP) strategy sets out the following principles.

"RAPs are automated statistical and analytical processes. They incorporate elements of software engineering best practice to ensure that the pipelines are reproducible, auditable, efficient, and high quality.

RAPs increase the efficiency of statistical and analytical processes, delivering value. Reproducibility and auditability increase trust in the statistics. The pipelines are easier to quality assure than manual processes, leading to higher quality.

A RAP will:

  • improve the quality of the analysis

  • increase trust in the analysis by producers, their managers and users

  • create a more efficient process

  • improve business continuity and knowledge management

To achieve these benefits, at a minimum a RAP must:

  • minimise manual steps, for example copy-paste, point-click, or drag-drop operation; where it is necessary to include a manual step in the process this must be documented as described in the following bullet points

  • be built using open-source software, which is available to anyone, preferably R or Python

  • deepen technical and quality assurance processes with peer review to ensure that the process is reproducible and that the requirements described in the following bullet points have been met

  • guarantee an audit trail using version control software, preferably Git

  • be open to anyone – this can be allowed most easily using file and code sharing platforms

  • follow existing good practice for quality assurance

  • contain well-commented code and have documentation embedded and version controlled within the product, rather than saved elsewhere

Notes:

  1. There may be restrictions, such as access to databases, which stop analysis producers building a RAP for their full end-to-end process. In this case, the previously described requirements apply to the selected part of the process.

  2. There may be restrictions, such as sensitive or confidential content, which stop analysis producers from sharing their RAP publicly. In this case, it may be possible to share the RAP within a department or organisation instead.

  3. It is recommended that where possible a RAP should be built collaboratively – this will improve the quality of the final product and helps to allow knowledge sharing.

There is no specific tool that is required to build a RAP, but both R and Python provide the power and flexibility to carry out end-to-end analytical processes, from data source to final presentation.

Once the minimum RAP has been implemented, statisticians and analysts should attempt to further develop their pipeline using:

  • functions or code modularity

  • unit testing of functions

  • error handling for functions

  • documentation of functions

  • packaging

  • code style

  • input data validation

  • logging of data and the analysis

  • continuous integration

  • dependency management"

9.2 Strategy goals

The Government’s RAP strategy includes three goals:

  1. tools – ensure that analysts have the tools they need to implement RAP principles.

  2. capability – give analysts the guidance, support and learning to be confident implementing the RAPs.

  3. culture – create a culture of robust analysis where the RAP principles are the default for analysis, leaders engage with managing analysis as software, and users of analysis understand why this is important.

9.3 Platforms for Reproducible Analysis

The two lists below are from the Government’s RAP strategy, to which we’ve added notes on current SCC status.

9.3.1 For RAPs that meet the minimum criteria

  1. version control software, that is, git - BI Team laptops & OSCAR desktop

  2. open-source programming languages and flexibility to add more (Python, R, Julia, JavaScript, C++, Java/Scala etc.) - R & Python on BI Team laptops, R on OSCAR desktop

  3. package and environment managers for each of the available languages - should have tested renv for R by now, if extending Council GIS the version of ArcGIS Pro we use has restrictive Conda use e.g. no virtual environments

  4. packages and libraries for open-source programming languages, either through direct access to well-known libraries, for example, npm, PyPI, CRAN, or through a proxy repository system, for example, Artifactory - BI Team laptops & OSCAR desktop

  5. individual storage, for example, home directory - SCC OneDrive

  6. shared storage, for example, s3, cloud storage, with fine-grained access control, accessible programmatically - don’t have, should we look at Azure Blob storage?

  7. integrated development environments suitable for the available languages – RStudio for R, Visual Studio Code for Python and so on - BI Team laptops & OSCAR desktop

In summary, with the exception of shared storage e.g. Azure Blob storage, some SCC data analysts have access to all of these minimum criteria for RAPs.

9.3.2 For further development of RAPs

  1. source control platforms, for example, GitHub, GitLab or BitBucket - trialing GitHub organisation (github.com/scc-pi)

  2. continuous integration tools, for example, GitHub Actions, GitLab CI, Travis CI, Jenkins, Concourse - trialing GitHub Actions but CI limited by the current GitHub security considerations (GitHub security & data protection)

  3. make-like tools for reproducible workflows, for example, make - identified the potential value of targets r package, but not tested yet

  4. relational database management software, for example, PostgreSQL, that is available to users - corporate solution being investigated and there’s some DBMS access available via OSCAR desktop

  5. orchestration systems for pipelines and workflows, for example, airflow, NiFi - don’t have

  6. internal-facing servers to host html-rendered documentation - request denied

  7. external-facing servers with authentication to host end-products such as web applications or APIs - don’t have

  8. big data tool, for example, Presto or Athena, Spark, dask and so on, or access to large memory capability - don’t have

  9. reproducible infrastructure and containers, for example, docker - docker on BI Team laptops but not tested yet

In summary, SCC data analysts don’t have most of the platforms needed for further development of RAPs. Docker and targets are accessible and could be trialed. GitHub is being trialed and purchasing would ease some current security restrictions and could include the CI and server platform items. No consideration yet of orchestration systems or big data tools.

9.4 How could RAP sit within SCC data analysis?

NB This sub-section is currently from the perspective of one individual. It needs a broader perspective, particularly from more experienced colleagues in PAS and data analysis team managers.

Sheffield City Council has aspirations to be data-driven, to make better evidence based decisions. Data analysis can support organisational change decisions on, for example, Planning Policy, Corporate Strategy, and changes to the operating model for Early Help in Children’s Services. Data analysis can also support continuous improvement under BAU (Business As Usual), helping managers understand how best to deploy their resources over the next month, highlighting issues that arose last week, and reporting performance metrics to senior managers.

The efficiencies from the automation aspect of RAPs may suit more repetitive (e.g. monthly) data analysis outputs that support BAU. The quality assurance aspect of RAPs may be more important for complex data analysis that supports major organisation change decisions. In practice, team culture and individual capability may have more to do with adopting RAP than the type or purpose of data analysis output. For example, an individual may be more likely to pursue RAP if they have a software development background, and they work in the BI Team that was initially experimental in nature and doesn’t have a large catalogue of BAU output and well established quality assurance procedures.

Elements of RAP have been used by SCC data analysts, but a full RAP has not yet been developed. The Council could pilot one or two RAPs. These should be done collaboratively i.e. with members of different data analysis teams. One could be re-engineering an existing process, another could be a new piece of data analysis.

9.5 MLOPs

Machine Learning Operations (MLOPs) is more specifically about automating the building, training, deployment, maintenance, and further development of models. RAPs cover a broader range of data analysis output.

9.6 Further resources

9.6.2 Training

Introduction to Reproducible Analytical Pipelines (RAP), free online ONS Data Science Campus Learning Hub course (hub now available to local government analysts)
Reproducible Analytical Pipelines (RAP) using R, 7hr free online Udemy course by Matthew Gregory
Reproducible Analytical Pipeline learning materials, Data Science Campus