3 Why script?

Sheffield City Council has made some fantastic progress in recent years with data analysis tools, such as the new GIS infrastructure and Power BI. These are powerful and intuitive, great for both full time data analysts and those for whom data analysis is only part of their responsibilities. Scripting is another tool that is becoming increasingly available to us now with the Council’s rapidly evolving IT improvements.

3.1 What is scripting?

Scripts are a series of typed commands in a file. When you “run” a script, you are asking a programme to interpret and fulfil the commands in the script to accomplish a given task.

The terms “scripting”, “coding” and “programming” are often used synonymously. More accurately though, coding and programming are umbrella terms that scripting is just a part of. Compared with coding and programming, scripting generally covers less complex tasks, with an easier to understand language, and with fewer commands.

3.2 Scripting for data analysis

Scripts can be used to accomplish a variety of tasks. For example, BCIS in Sheffield City Council use scripts to update and maintain laptops and servers.

For the purposes of these notes, the scope of scripting is limited to scripting for data analysis. So, scripting to accomplish data analysis tasks, including data: extraction, transformation, loading, exploration, analysis, modelling, visualisation, and narration.

Scripting for data capture, data entry, is outside the scope of these notes.

3.3 Scripting pros & cons?

There’s an element of self-documentation with scripting, so it’s easier to return to a scripted task after a bit of time away from it, or pick up a scripted task from a colleague. However, in-line comments and adherence to a style guide make a big difference to just how easy it is. Version control and collaboration are also generally easier with scripting than GUI (point-and-click) tools.

Scripting is not always better than using GUI tools. There’s a good reason why more people use Mac OS and Windows than Unix Bash. Scripting or GUI tools is a horses for courses question, and some GUI workflow tools that the Council has licenses for (e.g. FME, Simul8 and ModelBuilder in ArcGIS) sit somewhere in the middle.

Posit makes the (slightly biased?) case for scripting in an hour long webinar on Why Your Enterprise Needs Code-First Data Science.

Probably the single main advantage of scripting, is what was touched upon at the start of this sub-heading i.e. reproducibility.

3.4 Reproducibility

Scripting is a key component of RAP (Reproducible Analytical Pipelines), which is covered in a separate chapter on RAP.

At the heart of RAP is QA (Quality Assurance) and reducing manual steps in an analytical process. Such “automation” of a process is commonly motivated by time savings. However, with RAP the motivation is because documenting manual steps is difficult and they are more prone to error. The Reinhart-Rogoff error is a good example.

As projects get more complicated, so do their processes. Data may come from several different sources. It may be stored and processed in different locations using different software. You might disseminate the results in multiple formats.

3.5 Scripting languages

Python and R are two of the most popular scripting languages for data analysis. These are capable of covering the whole data analysis life-cycle of data extraction, transformation, analysis, and visualisation.

Some scripting languages are geared more to just one or two aspects of the data analysis life-cycle. SQL is for working with databases, whilst JavaScript is primarily for making web pages more interactive and adds value to data visualisation.

Other scripting languages extend the functionality of our point-and-click data analysis tools. VBA and DAX extend the functionality of Excel and Power BI respectively. Arcade is used in tandem with ArcGIS, and Mata is used with Stata.

3.5.1 R & Python

Python and R are open-source scripting languages popular with data analysts. Python is a general-purpose language with an easy-to-understand syntax. R’s functionality was developed with statisticians in mind and is popular in academia and research.

3.5.1.1 Notebooks

TODO: Jupyter & R Markdown notebooks

3.5.2 SQL

SQL (Structured Query Language) is designed for storing, manipulating and retrieving data in a database. SQL commands are simpler and their range narrower than R and Python commands. For example, time series or spatial analysis isn’t within scope of SQL. Also, whilst much of our data is still held in databases, data is now commonly from CSV files and other formats.

SQL is run on the database server, which even with today’s powerful laptops, can have time saving benefits. The type of database can mean the SQL extensions have different syntax, but the core elements such as SELECT and JOIN are generic.

Often it’s not a case of choosing to use SQL or say R, but using both. SQL to initially extract data from a corporate database. Then R to combine with data from other sources, e.g. ONS, and then analyse and visualise it.

3.5.3 VBA

VBA (Visual Basic for Applications) is Microsoft’s scripting language for, and built into, its Office programmes like Word and PowerPoint. However, for data analysis purposes, it is how it extends and automates Excel that is of interest.