Use of ReproZip
Last updated
Last updated
Migration is an approach for the preservation of software that underpins the faculty of rerunning the results on capturing information surrounding the purpose and use of software.
Migration can be seen as enabling software reuse and is tied to the broader notion of research reproducibility. By capturing information surrounding the purpose and use of software (rather than just the code itself), the migration process becomes more tractable since intent is made explicit rather than having to be deciphered from potentially incorrect code. All this information conforms the provenance metadata of the experiment. And the preservation platform should provide mechanisms to capture them in an automatic way. Apart from encouraging software reuse, this approach is especially appropriate to the purposes of achieving legal compliance and accountability and enabling continued access to data and services.
Although there are several established scientific workflow management tools such as Kepler (https://kepler-project.org), there has been a recent trend towards tools which automate the capture of the parameters and dependencies associated with software experiments for the purpose of reproducing those experiments in environments different than the original. In particular, ReproZip is notable because it provides options to run experiments within virtual machines or containers, in a repository in addition to using suitable execution platforms.
System architecture of ReproZip involves two main steps: packing, which captures the dependencies for an experiment and creates a self-contained package, and unpacking, which supports the extraction of the package’s content and the reproduction of the original experiment.
The packing step happens in the original environment (currently, only Linux), and generates a compendium of the experiment. reprozip is the command-line tool responsible for this step. ReproZip tracks operating system calls while a project is executing, and creates a package (an .rpz file) that contains all the binaries, files, dependencies, and all other necessary information and components for reproduction. By tracing system calls, ReproZip can transparently capture all the provenance of the experiment. The provenance data is analysed to detect the required components of the experiment. For instance, given the files that were read and using the package manager of the OS, ReproZip can identify the software packages on which the experiment depends. These .rpz files are much smaller than a virtual machine, and quite easy to share.
The unpacking step reproduces the experiment from the .rpz file. Given P, E can be reproduced by using the command-line tool reprounzip. ReproUnzip offers different unpacking methods, from simply decompressing the files in a directory to starting a full virtual machine, and they can be used interchangeably from the same packed experiment. It is also possible to automatically replace input files and command-line arguments. Reviewers can unpack .rpz files on Linux, Windows, and Mac OS X, since ReproUnzip can unpack the experiment in a container (Docker) or virtual machine (Vagrant). This step also has a graphical user interface option for users unfamiliar with the command line.
A ReproZip can be shared with others who can unpack, inspect, and reproduce the computational sequence in the repository where the package has been generated, or in their own environment. To this end, third party researchers must use:
The reprounzip component and their unpacker of choice; reprounzip-docker to use Docker, or reprounzip-vagrant to use Vagrant.
The software to be used by the unpacker; i.e., Docker or Vagrant or VirtualBox.
Even if this is only required once, having to download and set up these tools can prove to be a heavy burden (and intrusive), especially in the author-reviewer scenario where reviewers have a short turnaround time. That means that a different researcher may prefer to reproduce an experiment on the repository hosting it. In this case, the repository must provide him/her the required solutions for the unpackage and re-run of the ReproZip package, and the software required by the corresponding unpacker.
A third option is available. It is ReproServer (https://github.com/ViDA-NYU/reproserver), an open source Web application that allows users to reproduce experiments from the comfort of their Web browser.
Sources:
Chirigati, F., Rampin, R., Shasha, D., and Freire, J. (2016). ReproZip: Computational Reproducibility With Ease. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA, 2085–2088. https://doi.org/10.1145/2882903.2899401
Rampin, R., Chirigati, F., Shasha, D., Freire, J., and Steeves, V. (2016). Reprozip: The reproducibility packer. Journal of Open Source Software, 1(8), 107. https://joss.theoj.org/papers/10.21105/joss.00107.
Vendor and technology independence – In order to make as easy as possible the migration from the platform to another preservation platform in the future, consider in the design the capability to make EVERY content property and attributes accessible using the API, making extracting all content and associated metadata possible and simple, like integrity, events, etc.
Exit and Migration Strategies – Ability to plan and ensure archiving and preservation services components can be replaced efficiently with minimal disruption. Demonstrate flexibility in the deployment models available so that solutions can run on top of public clouds, private clouds or in a hybrid on-premises/public cloud model, as an essential part of business continuity and disaster recovery plans and the establishment of an exit strategy in order to avoid vendor lock-ins. --> The full support of the OAIS Information Model, including the ability to create complete OAIS Archival Information Packages, ensures that the fundamental exit strategy is built into the solution.
More specifically, Other Representation Information can capture the virtual machines/compute containers/software that are required within a specific dataset. This Other Representation Information will be integrated within the Representation Information Network as part of a whole Archive Information Package.
ReproZip is itself software and must itself be preserved for example using Emulation/Virtualisationor rebuilding from source code, which may requiring updating to fit in with updates to the build environment.