Virtual machines as Other RepInfo

Virtual machines (VMs) can provide independence from specific hardware and software, but the VMs must themselves be preserved.

Reproducibility by means of virtualisation

Virtualisation is part of one of the approaches for preserving computational artifacts, the Emulation. Emulation consists in capturing all or part of a hardware or software in order to be used in a different environment. Also Containerisation is involved in Emulation. Both Virtualisation and Containerisation are able to encapsulate into a single package all the digital artifacts: operating system components, scripts, code and even data.

A key aspect in virtualisations is the use of virtual machines (from now on, VM). VMs encapsulate analytical software and dependencies within a “guest” operating system, which may be different from the main (“host”) operating system. Additionally, scripts, code, and data are necessary to execute a computational analysis.

Building and execution of VMs can be achieved in two different ways:

Using hypervisor software or virtual machine applications on a local computer, or on hosted on-prem infrastructure within a repository —such as those created within LABDRIVE. These tools are based on interaction with the guest operating system via a graphical user interface. Additionally, they can be applied in combination with automation systems.
Using tools and hosting services provided by public scientific clouds, and, also commercial clouds.

Hypervisors support the creation and management of VMs by abstracting a computer’s software from its hardware. Hypervisors make virtualization possible by translating requests between the physical and virtual resources. Examples of this software are VirtualBox (https://www.virtualbox.org) and XenProject (https://xenproject.org) —both open source— and VMWare (https://www.vmware.com) —partially open source. Once having been built, a VM can be executed on practically any desktop, laptop, or server, irrespective of the main (“host”) operating system on the computer. For example, even though a scientist’s computer may be running a Windows operating system, they may perform an analysis on a Linux operating system that is running concurrently —within a VM— on the same computer.

The original researcher (producers class) has full control over the virtual (“guest”) operating system, and thus can install software and modify configuration settings as necessary. In addition, a VM can be constrained to use specific amounts of computational resources (e.g., computer memory, processing power), thus enabling system administrators to ensure that multiple VMs can be executed simultaneously on the same computer without impacting each other’s performance. After executing an analysis, the researcher can export the entire VM to a single, binary file.

When using VMs to support reproducibility, it is important that other scientists can not only re-execute the analysis, but also examine the scripts and code used within the VM. Although it is possible for others to examine the contents of a VM directly, it is preferable to store the scripts and code in public repositories —separately from the VM— so others can examine and extend the analysis more easily.

Original researcher can automate the process of building and configuring VMs using tools such as Vagrant (https://www.vagrantup.com) or Vortex (https://github.com/websecurify/node-vortex). For either tool, users can write text-based configuration files that provide instructions for building VMs and allocating computational resources to them. In addition, these configuration files can be used to specify analysis steps. Because these files are text based and relatively small (usually a few kilobytes), scientists can share them easily and track different versions of the files via source control repositories. This approach also mitigates problems that might arise during the analysis stage. For example, even when a computer’s host operating system must be reinstalled because of a computer hardware failure, the VM can be recreated with relative ease.

As noted earlier, original researchers can use a VM that has been pre-packaged for a particular research discipline. In these cases, the scripts for building this VM are stored in a public repository. Some prominent examples are as follows.

For example, CloudBioLinux (http://cloudbiolinux.org) contains a variety of bioinformatics tools commonly used by genomics researchers.
Scientific cloud services which can package up environments for sharing, aimed at computer scientists and the broader long-tail research communities respectively: Chameleon (https://www.chameleoncloud.org) and Jetstream (http://jetstream-cloud.org).
Cloud-based services to capture and preserving research environments with solutions specific to some domains, as, for example, the high-energy physics community. The DASPOS project (https://daspos.crc.nd.edu) and CERNs OpenData (http://opendata.cern.ch/?ln=en).

How LABDRIVE supports reproducibility by means of emulation

LABDRIVE is able to add emulation services, in a plug-in mode without changing the architecture. More specifically, Other Representation Information can capture the VMs/compute containers/software that are required within a specific dataset. This Other Representation Information will be integrated within the Representation Information Network to conform a whole Archive Information Package. For creating Representation Information Networks for reproducibility of research see Method 1: Representation Information Networks within the OAIS model.

In addition, as LABDRIVE allows all components to be Kubernetes containerised, the artifacts set within these technologies can be run independently of on-prem infrastructure.

PreviousEmulation/Virtualisation NextDocker and other containers as Other RepInfo

Last updated 3 years ago

Was this helpful?