Docker and other containers as Other RepInfo

Reproducibility by means of containerisation

Containerisation is part of one of the approaches for preserving computational artifacts, the Emulation. Emulation consists in capturing all or part of a hardware or software in order to be used in a different environment. Also Virtualisation is involved in Emulation. Both Containerisation and Virtualisation are able to encapsulate into a single package all the digital artifacts: operating system components, scripts, code and even data.

A container package contains both the code and all its dependencies so the application runs quickly and reliably from one computer environment to another, but the difference with respect to visualization solutions is that they interact directly with the operating system of the receiving computer, and only applies the necessary extensions. Using ‘recipes’, the container images can be specified and built to include a snapshot of all software dependencies readily installed in a convenient image file. Since all required software is installed into the container image, the computing environment is independent of the future availability of external sources and completely self-contained. At execution time of a workflow, commands can be run within a container instance using the exact same image file as during the original analysis. Since the container image is just a single file, it can easily be stored and shared.

Researchers can build container images that work similarly to virtual images, but instead of bundling all the data and software dependencies in a single file, the container image is built from stackable pieces. Being a more lightweight solution, containers are smaller and have less overhead than virtual images.

Two different instances of the containerisation approach can be distinguised: Docker and Singularity containers.

Docker

One of the most useful instances of the containerization approach is Docker (https://www.docker.com/), available for both Linux and Windows-based applications.

Within Docker architecture, there are three key components: Dockfiles, Images and Containers.

  • A Dockfile is a machine- and human-readable recipe for building the computational environment (analogous to source code in a compiled programming language). This is used to build an image with the docker build command, analogous to compiling the source code into an executable (binary) file.

  • An Image is an executable file that include the application, e.g., the programming language interpreter needed to run a workflow, and the system libraries required by an application to run. It is structured in stacked layers, where each image includes software components to address a particular need.

  • A Container is simply another process on "our" machine that has been isolated from all other processes on the host machine.

The process of building a Docker container is as follows.

  1. Original researcher write text files that follow a particular format called Dockerfile. Thus, a Dockerfile consists of a sequence of instructions to copy files and install software.

  2. Each instruction adds a layer to the image, which can be cached across image builds for minimising build and download times.

  3. Once an image is built or downloaded, it is then launched as a running instance known as a container. The images have a main executable exposed as an “entry point” that is started when they are run as stateful containers. Further, containers can be modified, stopped, restarted, and purged.

Dockerfiles are text based, so they can be shared easily, and can be tracked and versioned in source control repositories. Once a Docker container has been built, its contents can be exported to a binary file; these files are generally smaller than virtual machine files, so they can be shared more easily via dedicated platforms, such as DockerHub (http://hub.Docker.com), or via a specific repository —for example, a repository built with LABDRIVE software.

Within a given research lab, scientists might create general purpose images to support functionality for multiple projects, and specialized images to address the needs of specific projects. An advantage of Docker’s modular design is that when images within a container are updated, Docker only needs to track the specific components that have changed; users who wish to update to a newer version must download a relatively small update.

Docker has a significant presence in many science fields, like genomics, phylogenomics, bioinformatics, deep learning, high-energy physics, and so on. The Docker software had contributors from many organizations, including Google, IBM, and Microsoft as well as the team at Docker, Inc. It has been widely adopted in industry, which led to fast innovation and sustainability from which researchers can directly benefit.

Docker architecture is really simple, but it is important to note that to leverage the full potentials of Docker, the existence of a scripted scientific workflow is assumed, i.e., it is possible, at least at a given point in time, to execute the full process with a fixed set of commands. For example, make prepare_data followed by Rscript analysis.R (when using R programming language) or python3 my- workflow.py (when using python programming language). A workflow that does not support scripted execution is also out of scope for reproducible research, as it does not fit well with containerisation.

In addition, workflows interacting with many Petabytes of data and executed in high-performance computing (HPC) infrastructures are out of scope of Docker, as it is not ideal for isolating the software dependencies of analysis pipelines since it requires root access during execution. Docker is fine for workflows running on single machine —for example, a researcher’s own laptop computer or a virtual server— but poses a severe limitation on shared resources like cloud or high-performance-computing (HPC) systems where users typically do not have root access for privacy and security reasons. For these scenarios, an alternative to Docker is Singularity containers.

Singularity containers

Singularity containers (https://sylabs.io/singularity/) overcome the limitations of Docker and allows containers to run in user-space. This provides two key strenghts:

  • Singularity container images can simply be uploaded to an HPC cluster and run without any elevated permission. The flexibility to define a computing environment that can be exactly instantiated on both laptops as well as HPC systems means that users only need to learn about one technology instead of hitting a barrier for more computationally intensive tasks.

  • With Singularity container images, it is easier to test code locally before issuing long-running tasks to the HPC system since the computing environment (container image) is exactly the same. In fact, once singularity is installed on the respective HPC, no other software needs to be installed since users can do so on their local laptops and simply upload their container images to the HPC.

The container software Singularity uses its own format, called the Singularity recipe, but it can also import and run Docker images. Singularity supports Docker integration so that an ample amount of work has gone into developing Docker images without relying directly on the user to install the Docker engine. This is done by way of harnessing the Docker Registry API, a RESTful interface that gives access to image manifests, each of which contains information about the image layers. And also Docker images hosted in a LABDRIVE repository can be imported and run by the container software Singularity to launch the API mechanisms provided, for example, by LABDRIVE.

Good practices for the reproducibility of Dockerfiles

A bunch of good practices for the development of notebooks in order to underpin its future reproducibility has been identified by Nüst, Sochat, Marwick, Eglen, Head, Hirst and Evans*, based on a deeply and extensive study.

  • Rule 1: Use available tools.

  • Rule 2: Build upon existing images.

  • Rule 3: Format for clarity.

  • Rule 4: Document within the Dockerfile.

  • Rule 5: Specify software versions.

  • Rule 6: Use version control.

  • Rule 7: Mount datasets at run time.

  • Rule 8: Make the image one-click runnable.

  • Rule 9: Order the instructions.

  • Rule 10: Regularly use and rebuild containers.

Because the Singularity container software can also import and run Docker images, those rules here are, to some extent, transferable to Singularity recipes.

*Nüst, D., Sochat, V., Marwick, B., Eglen, S. J., Head, T., Hirst, T., and Evans, B. D. (2020). Ten simple rules for writing Dockerfiles for reproducible data science. PLoS Computational Biology, 16(11), e1008316. https://doi.org/10.1371/journal.pcbi.1008316.

How LABDRIVE supports reproducibility by means of emulation

LABDRIVE is able to add emulation services, in a plug-in mode without changing the architecture. More specifically, Other Representation Information can capture the virtual machines/compute containers/software that are required within a specific dataset. This Other Representation Information will be integrated within the Representation Information Network to conform a whole Archive Information Package. For developing Representation Information Networks within LABDRIVE see Preservation actions for software.

In addition, as LABDRIVE allows all components to be Kubernetes containerised, the artifacts set within these technologies can be run independently of on-prem infrastructure.

Last updated