8 Version control & GitHub

8.1 Why version control?

When scripting you might make a backup copy of a file, so if your new changes break what was previously working you can revert to the backup copy. If you’re collating contributions to a document from different people you might use their initials in the file names to indicate and keep track of who’s made contributions in what version e.g. “BI_Principles-GR.docx” and “BI_Principles-NM.docx”. These are both examples of manual version control.

When we talk about version control and scripting we typically mean a version control system. A tool that assists with tracking changes to files over time and by different people. Such systems also usually include reasons for the changes, and file, or rather version, comparison functionality. So a version control system is a backup/audit-trail/collaboration thing.

This is a much more engaging explanation:

GITHUB FOR SUPPORTING, REUSING, CONTRIBUTING, AND FAILING SAFELY

8.2 Git

Git is an open-source distributed version control system for tracking changes in source code during software development. There are other version control tools, such as SVN, but Git is the most popular.

GitHub is a product that offers Git-based source control repositories. Other products also offer Git-based repositories, such as GitLab, BitBucket, and Azure DevOps.

8.3 Transparency

The benefits of publishing open data are well understood and accepted as good practice. Less well known is the similar best practice of publishing source code that generates data analysis output. There are circumstances when this is not possible, usually because of confidentiality reasons. For example, with literate programming where code chunks, narrative text, and analysis outputs are combined, there may be a risk of including PID (Person Identifiable Data).

One option for balancing transparency and security in a version control implementation is:

  1. Initially host a new repository in a secure, private, enterprise platform.
  2. Review whether repository is fit-for-publishing.
  3. Publish repository if fit.

A fuller explanation of the benefits of source code transparency and how it can be implemented is available from How to Publish your Code in the Open, NHS RAP Community of Practice. There’s similar guidance on open sourcing analytical code, from the Government Analysis Function.

For public repositories, GitHub is undoubtedly the market leader. For private repositories it’s not as clear cut. There’s a GitHub option for private repositories, but it’s more expensive than for example BitBucket. Azure DevOps obviously has the advantage of being part of the Azure infrastructure that the Council is investing in.

Adopting data analysis version control and transparency practices within the Council is work-in-progress. In terms of the implementation listed above, this is our current status:

  1. Trialing 2 free licenses of Azure DevOps for private repository hosting.
  2. No checklist or process to support reviewing whether repository is fit-for-publishing.
  3. Repositories published on GitHub.

8.4 GitHub

Our scc-pi GitHub organisation is a free account.

8.4.1 GitHub security & data protection

GitHub.com is hosted in the US, which is an issue in terms of our data protection obligations. Storing PID (Person Identifiable Data), or anything commercially confidential or politically sensitive, on GitHub.com is a no-no.

The .gitignore file specifies the folders and files you want to exclude from the repository i.e. it lists what you want to ignore. Holding any file based data (e.g. CSV files), whether they’re original or transformed, in a /data sub-folder, makes it simple to exclude data from the repository by adding the following to the .gitignore file:

# data directory
/data/

As an additional precaution, any projects involving confidential information should be held in a private repository and .Rdata should also be included in the .gitignore file (see the section on RStudio security & data protection):

# Session Data files
.RData

Microsoft own GitHub and GitHub Enterprise Server is the on-premises deployment of GitHub.com that could be hosted on our own Azure environment.

8.4.2 Forking & pull requests

GitHub forking and pull requests are common workflows when collaborating on a repository.

TODO - we’ll have more to add on this once we’ve collaborated on a repository

8.4.3 Other features

GitHub include private repositories on free organisation accounts, but not all GitHub features are available.

The evolution of a repository is captured by commits i.e. the changes to files in the repository. However, the original issue, ideas or discussion surrounding a commit is captured in a GitHub issue. You can link a commit to an issue by including the issue number prefixed by # in the commit message.

We’ve used a GitHub project as a kanban board to organise our work on the community-knowledge-profiles repository. However, a project is not restricted to issues or a single repository.

GitHub pages and actions have been used to publish these notes.

There are also other GitHub features to explore.

8.4.4 GitHub & RStudio

Introduction to GitHub and the GitHub Learning Lab are good places to get started and become familiar with some initial concepts.

To use GitHub with RStudio there’s a cheatsheet and a thorough step-by-step guide, Happy Git with R by Jenny Bryan.

Over the last few years we’ve followed Jenny’s recommendations:

  • Initially to use HTTPS.
  • Then to use SSH whilst GitHub was introducing their PAT requirement.
  • Now back to HTTPS, but with PAT.

Our recommendation is to follow these chapters in Happy Git with R:

Chapters 15, 16 and 17 of Happy Git with R cover different ways of setting up a local RStudio project connected to GitHub. So far, we’ve stuck with Chapter 15 New project, GitHub first.

Once you’ve setup your PAT for HTTPS, to change an existing local repo from SSH to HTTPS:

  • Make sure all changes in your repo are committed and pushed to GitHub.
  • Delete the local repo folder.
  • Create the local repo anew via RStudio > New Project > Version Control. However, this time use the HTTPS URL from the GitHub repo.

If you know a better way than this please let us know!

8.5 Azure DevOps

We are hopeful that Azure DevOps will become the new default for the Council’s data analysis source code repositories. A first trial is underway.

Azure DevOps costs £5 per user per month.

8.6 BitBucket

BitBucket was used by Digital Services but they now piggy-back on a digital partner’s version control implementation. BitBucket has an EU hosting option and is cheaper than GitHub. It is more popular with organisations needing private repositories. BitBucket has a narrower scope than GitHub. Instead, BitBucket focuses on stronger integration with other software, in particular other Atlassian products such as Trello, Jira and Confluence.

8.7 Proxy server

When out of the office and connected via AlwaysOn VPN, both the HTTPS and SSH connection methods for GitHub and Azure DevOps should work. However, SSH will never work when in the office because the Council’s proxy server doesn’t support SSH. It works when out of the office because when connected via AlwaysOn VPN, not all traffic is routed via the Council and its proxy server.

Git needs to be configured to use the Council’s proxy server.

From the Git CMD app you need to run the Git command immediately below. First though, you need to substitute in the command your Council (not your GitHub) username and password, plus the proxy server IP address.

git config --global http.proxy http://SCC\username:password@proxyip:8080

NB: When you update your Council password you’ll need to rerun the command with your new password.