Transparent research – why sharing code is important
Dr Roxanne Connelly, Professor Vernon Gayle, Dr Chris Playford
There are growing concerns about research transparency across a range of academic fields
There is increasing anxiety about difficulties in verifying the results presented in many research papers. We believe that a conventional paper-based journal article should be considered as the tip of the iceberg of the research process.
In a similar vein, nearly twenty five years ago, Jon Claerbout of Stanford University stated that in engineering a published paper should be considered as an advertisement for the scholarship.
Conventional paper-based academic journals tend to have a word limit that precludes the provision of detailed information on how the reported results were produced. It is a long-standing practice for authors to make the statement, usually in a footnote, that further information is available on request. We argue that this is not sufficient to ensure transparency and we note that Cristobal Young reported getting a dismal number of responses when he undertook a field experiment requesting additional information from a sample of sociologists.
The seemingly straightforward act of duplicating the results from social science studies that use household panel data, or other large-scale multipurpose data sources, is in fact exceptionally difficult. Every researcher that we have spoken to who has attempted to duplicate published results has at least one story about a failure.
In principle, it should not be difficult to duplicate results from the analysis of survey data, since researchers usually have access to the ‘raw dataset’ (and can read the published result). In practice there is a black box within the research process into which researchers cannot gain access, and this prevents the duplication of the published results.
Expressed simply, the published analyses are undertaken using an ‘analytical dataset’ which has been developed from the ‘raw dataset’. In reality a large amount of work has to be undertaken to produce the ‘analytical dataset’, and this work is hidden within the black box.
What is involved?
Data resources such as Understanding Society and the British Household Panel Study are large-scale and complex in structure. Because they are multipurpose surveys, they are not delivered as a single file, or as a simple variable-by-case matrix. On the contrary, a range of files are provided with data on the household, individual adult respondents, young people and other individuals. This allows researchers to construct a wide range of analytical datasets, for example on all adults, adults within a household, adults in a specific age range, spouses, parents, parents and their children, siblings and so on. Building even simple datasets requires records in multiples files to be matched. The panel design presents a further challenge because data are collected on survey members at each wave. The data are released in wave-specific files, which must be linked for longitudinal analyses.
In addition to the data file construction activities required to use household panel datasets, the routine data enabling work (or data wrangling) that is commonplace for any social survey data analysis will also be required. This will include necessary operations such as recoding variables into a suitable format for the analyses, and appropriately coding missing values (which are pervasive in social surveys).
Multipurpose studies such as Understanding Society collect an assortment of measures in order to support a wide range of analyses. For example, there are a suite of measures relating to occupations, social class, education, income and health. Selecting appropriate variables is not a trivial activity, and it will be guided both by theoretical and operational considerations.
As any experienced survey data analyst will confirm, a large amount of work goes into making large-scale datasets ‘analysis-ready’. Typically, many lines of code (e.g. in Stata, SPSS or R) will be required to bring together relevant data files, and hundreds of lines of code will be written to construct an analytical dataset.
The descriptive information reported in a conventional paper-based academic journal article which analyses large-scale and complex data is often good for communicating information about the ‘analytical variables’ (e.g. proportions, means, standard deviations etc.). However, usually it tells us little or nothing about how the ‘raw variables’ were transformed into the ‘analytical variables’. Similarly, outputs such as multiway-tables and graphs provide few clues on the construction of the ‘analytical variables’. Tables of regression results are impossible to reverse engineer, and they typically provide no detailed information on the specific ‘analytical variables’.
Opening the ‘black box’
There is growing interest in researchers providing extra materials alongside conventional publications. The goal is to open the black box containing the ‘research code’ used to transform the ‘raw dataset’ into the ‘analytical dataset’ and the code used for analyses. First, the additional materials should allow other researchers to ‘duplicate’ results (i.e. be able to produce identical results). Second, they should enable researchers to undertake replications, for example with additional data, other measures, or alternative data analysis techniques (or any combinations of these extensions). Sharing ‘research code’ will help to enable other researchers to better understand, evaluate and build upon published work.
How to share
As a first step, we would recommend that authors provide a link to a location where their research code is openly available such as the Open Science Framework. Code can also be hosted on the Understanding Society website. This may simply be a very well annotated Stata .do file, an SPSS .sps file, or an R script.
A more sophisticated approach would be to provide a ‘duplication package’. An important development in the natural sciences and e-research is the use of Jupyter notebooks. These enable researchers to share live code (e.g. Stata or R code), alongside data analysis outputs (e.g. modelling results, plots etc.), and both code and outputs can be accompanied by text describing and detailing the entire analytical research process. Jupyter notebooks provide an excellent platform for undertaking transparent research and sharing research code when analysing large-scale and complex datasets such as Understanding Society.
Roxanne is a Senior Lecturer in Sociology at the University of York
Vernon is Professor of Sociology and Social Statistics at the University of Edinburgh
Chris is a Lecturer in Sociology at the University of Exeter