Prizes for Reproducibility in Audio and Music Research: How we evaluated the entries
We recently announced the results of our first Prizes for Reproducibility in Audio and Music Research. Here we describe how the evaluation was carried out and winners chosen.
Submissions were assessed against three separate criteria:
- Ease of reproducibility of the results
- Quality of sustainability planning
- Potential to enable high quality research in the UK audio and music research community.
The first criterion was applied only if the paper included something to reproduce, such as figures generated using a software program. Some submissions consisted of work such as datasets intended to facilitate reproducible work from later authors: such submissions were assessed on the latter two criteria only. In future calls we will probably use separate categories for works that provide infrastructure for reproducible and sustainable work from others rather than aiming at reproducibility themselves, but here the two were assessed together.
Each of these criteria was assessed by a separate panel, as described below. We then made a shortlist from those submissions which had scored 3 or better on every criterion. (The assessments used different scales, but in each case had low numbers scoring better than high ones.) Shortlisted submissions were assigned to categories according to the type of publication they contained, as listed in the call, and the winner was that with the best average (mean) score in the category.
Here are how the individual criteria were assessed:
Ease of reproducibility
To assess this criterion, we at SoundSoftware attempted to obtain and run the software associated with the paper and regenerate the results shown. This is a straightforward baseline replicability test.
The scale used for this criterion was:
- Excellent. A single command or script reproduced the figures in the paper.
- Good. It was possible to generate figures like those in the paper, perhaps incomplete or with some adjustment to parameters, but without code changes or author intervention.
- Passable. Results were generated but not without effort, for example modifying the code or reverse-engineering how to call it.
- Modest. Although we were able to run the code, no means was provided to reproduce the figures in the paper.
- Nil. We could not get the code to work.
This criterion was assessed by a team at the Software Sustainability Institute: Tim Parkinson (Principal Software Consultant, Software Sustainability Institute and University of Southampton), Arno Proeme (Software Sustainability Institute and EPCC, University of Edinburgh); and Neil Chue Hong (Director of the Software Sustainability Institute). Many thanks to the SSI for their involvement in this work.
The sustainability assessment took into account factors such as whether the code and/or data were stored in a suitable repository, whether version control was used, whether tools for community involvement such as issue trackers and support mechanisms were available, and whether the work was properly licenced. These assessments included commentary and a score on a four-point scale:
See the Institute's sustainability evaluation pages for more information about their approach.
Potential to enable high quality research
To assess this criterion, each submission was sent to two external reviewers. Many thanks to the willing reviewers: Tim Crawford; Dan Ellis; Fabien Gouyon; Panos Kudumakis; Piotr Majdak; Alan Marsden; Mark Plumbley; Bob Sturm; and Tillman Weyde. Their reviews included both commentary and a score on a five-point scale:
- Very good.
- Very weak.
When scoring this criterion, we took the average of the two reviewers' scores.