LogoLogo
  • What is LABDRIVE
  • Concepts
    • Architecture and overview
    • Organize your content
    • OAIS and ISO 16363
      • Understanding OAIS and ISO 16363
      • LABDRIVE support for OAIS Conformance
      • Benefits of preserving research data
      • Planning for preservation
      • ISO 16363 certification guide
      • LABDRIVE support for FAIRness
  • Get started
    • Create a data container
    • Upload content
    • Download content
    • Introduction to metadata
    • Search
    • File versioning and recovery
    • Work with data containers
    • Functions
    • Storage mode transitions
    • Jupyter Notebooks
  • Configuration
    • Archive organization
    • Container templates
    • Configure metadata
    • Users and Permissions
    • Running on premises
  • DATA CURATION AND PRESERVATION
    • Introduction
    • Information Lifecycles
    • Collecting Information needed for Re-Use and Preservation
    • Planning and Using Additional Information in LABDRIVE
    • How to deal with Additional Information
      • Representation Information
      • Provenance Information
      • Context Information
      • Reference Information
      • Descriptive Information
      • Packaging Information
      • Definition of the Designated Community(ies)
      • Preservation Objectives
      • Transformational Information Properties
    • Preservation Activities
      • Adding Representation Information
        • Semantic Representation Information
        • Structural Representation Information
        • Other Representation Information
          • Software as part of the RIN
            • Preserving simple software
              • Jupyter Notebooks as Other RepInfo
            • Preserving complex software
              • Emulation/Virtualisation
                • Virtual machines as Other RepInfo
                • Docker and other containers as Other RepInfo
              • Use of ReproZip
      • Transforming the Digital Object
      • Handing over to another archive
    • Reproducing research
    • Exploiting preserved information
  • DEVELOPER'S GUIDE
    • Introduction
    • Functions
    • Scripting
    • API Extended documentation
  • COOKBOOK
    • LABDRIVE Functions gallery
    • AWS CLI with LABDRIVE
    • Using S3 Browser
    • Using FileZilla Pro
    • Getting your S3 bucket name
    • Getting your S3 storage credentials
    • Advanced API File Search
    • Tips for faster uploads
    • File naming recommendations
    • Configuring Azure SAML-based authentication
    • Exporting OAIS AIP Packages
  • File Browser
    • Supported formats for preview
    • Known issues and limitations
  • Changelog and Release Notes
Powered by GitBook
On this page
  • Reproducibility by means of migration
  • Using ReproZip for the reproducibility of experiments
  • How LABDRIVE supports reproducibility by means of migration
  • Preservation Challenges of ReproZip

Was this helpful?

  1. DATA CURATION AND PRESERVATION
  2. Preservation Activities
  3. Adding Representation Information
  4. Other Representation Information
  5. Software as part of the RIN
  6. Preserving complex software

Use of ReproZip

PreviousDocker and other containers as Other RepInfoNextTransforming the Digital Object

Last updated 2 years ago

Was this helpful?

Reproducibility by means of migration

Migration is an approach for the preservation of software that underpins the faculty of rerunning the results on capturing information surrounding the purpose and use of software.

Migration can be seen as enabling software reuse and is tied to the broader notion of research reproducibility. By capturing information surrounding the purpose and use of software (rather than just the code itself), the migration process becomes more tractable since intent is made explicit rather than having to be deciphered from potentially incorrect code. All this information conforms the provenance metadata of the experiment. And the preservation platform should provide mechanisms to capture them in an automatic way. Apart from encouraging software reuse, this approach is especially appropriate to the purposes of achieving legal compliance and accountability and enabling continued access to data and services.

Although there are several established scientific workflow management tools such as Kepler (), there has been a recent trend towards tools which automate the capture of the parameters and dependencies associated with software experiments for the purpose of reproducing those experiments in environments different than the original. In particular, is notable because it provides options to run experiments within virtual machines or containers, in a repository in addition to using suitable execution platforms.

Using ReproZip for the reproducibility of experiments

System architecture of ReproZip involves two main steps: packing, which captures the dependencies for an experiment and creates a self-contained package, and unpacking, which supports the extraction of the package’s content and the reproduction of the original experiment.

The packing step happens in the original environment (currently, only Linux), and generates a compendium of the experiment. reprozip is the command-line tool responsible for this step. ReproZip tracks operating system calls while a project is executing, and creates a package (an .rpz file) that contains all the binaries, files, dependencies, and all other necessary information and components for reproduction. By tracing system calls, ReproZip can transparently capture all the provenance of the experiment. The provenance data is analysed to detect the required components of the experiment. For instance, given the files that were read and using the package manager of the OS, ReproZip can identify the software packages on which the experiment depends. These .rpz files are much smaller than a virtual machine, and quite easy to share.

The unpacking step reproduces the experiment from the .rpz file. Given P, E can be reproduced by using the command-line tool reprounzip. ReproUnzip offers different unpacking methods, from simply decompressing the files in a directory to starting a full virtual machine, and they can be used interchangeably from the same packed experiment. It is also possible to automatically replace input files and command-line arguments. Reviewers can unpack .rpz files on Linux, Windows, and Mac OS X, since ReproUnzip can unpack the experiment in a container (Docker) or virtual machine (Vagrant). This step also has a graphical user interface option for users unfamiliar with the command line.

A ReproZip can be shared with others who can unpack, inspect, and reproduce the computational sequence in the repository where the package has been generated, or in their own environment. To this end, third party researchers must use:

  • The reprounzip component and their unpacker of choice; reprounzip-docker to use Docker, or reprounzip-vagrant to use Vagrant.

  • The software to be used by the unpacker; i.e., Docker or Vagrant or VirtualBox.

Even if this is only required once, having to download and set up these tools can prove to be a heavy burden (and intrusive), especially in the author-reviewer scenario where reviewers have a short turnaround time. That means that a different researcher may prefer to reproduce an experiment on the repository hosting it. In this case, the repository must provide him/her the required solutions for the unpackage and re-run of the ReproZip package, and the software required by the corresponding unpacker.

Sources:

How LABDRIVE supports reproducibility by means of migration

Vendor and technology independence – In order to make as easy as possible the migration from the platform to another preservation platform in the future, consider in the design the capability to make EVERY content property and attributes accessible using the API, making extracting all content and associated metadata possible and simple, like integrity, events, etc.

Exit and Migration Strategies – Ability to plan and ensure archiving and preservation services components can be replaced efficiently with minimal disruption. Demonstrate flexibility in the deployment models available so that solutions can run on top of public clouds, private clouds or in a hybrid on-premises/public cloud model, as an essential part of business continuity and disaster recovery plans and the establishment of an exit strategy in order to avoid vendor lock-ins. --> The full support of the OAIS Information Model, including the ability to create complete OAIS Archival Information Packages, ensures that the fundamental exit strategy is built into the solution.

More specifically, Other Representation Information can capture the virtual machines/compute containers/software that are required within a specific dataset. This Other Representation Information will be integrated within the Representation Information Network as part of a whole Archive Information Package.

Preservation Challenges of ReproZip

ReproZip is itself software and must itself be preserved for example using Emulation/Virtualisationor rebuilding from source code, which may requiring updating to fit in with updates to the build environment.

A third option is available. It is ReproServer (), an open source Web application that allows users to reproduce experiments from the comfort of their Web browser.

Chirigati, F., Rampin, R., Shasha, D., and Freire, J. (2016). ReproZip: Computational Reproducibility With Ease. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA, 2085–2088.

Rampin, R., Chirigati, F., Shasha, D., Freire, J., and Steeves, V. (2016). Reprozip: The reproducibility packer. Journal of Open Source Software, 1(8), 107. .

https://github.com/ViDA-NYU/reproserver
https://doi.org/10.1145/2882903.2899401
https://joss.theoj.org/papers/10.21105/joss.00107
https://kepler-project.org
ReproZip
Step 1 in ReproZip building. Packing (Source: Rampin, Chirigati, Shasha, Freire and Steeves (2016)
Step 2 in ReproZip building. Unpacking (Source: Rampin, Chirigati, Shasha, Freire and Steeves (2016)