Architecture and overview

Data containers

Every object in LABDRIVE is preserved in a Data Container. Data containers define the policies, functions, permissions and the underlying storage policy for the files they contain, and have many similarities with Amazon S3 buckets or Azure containers, as they can hold files, folders, metadata, etc.

Metadata

LABDRIVE allows users to add metadata to any file, folder or data container preserved in the system. Functions can be used to automatically generate metadata on ingest, or it can be added manually, or both. Metadata can be used for searching (even using complex queries), exported or consumed by other systems.

Combining the possibility to keep metadata associated to the objects with a powerful search engine, makes it possible to perform complex actions in an easy way, for instance, download all datasets in the repository for a given experiment and date period, or to retrieve all datasets for a particular region.

On top of it, organizations can use the LABDRIVE Functions to extract or create metadata automatically upon upload. For instance, if you are uploading images with EXIF metadata, you can then filter your content based on it.

Several types of metadata exist in LABDRIVE. Some are generated by the platform (like the events the users generate, or the characterization of the preserved files, while others are usually introduced by the user, such as descriptive metadata, for which the user is capable of selecting from multiple options when creating a metadata schema, such as strings, dates, enumerated values, links to other objects, etc.

Creating OAIS Representation Information Networks is also possible, as LABDRIVE is fully aligned with OAIS:

Further details are available in OAIS and ISO 16363 and in particular LABDRIVE support for OAIS Conformance.

Reproducibility and digital notebooks

LABDRIVE is integrated with Jupyter Notebooks.

Jupyter notebooks are documents that containing an ordered list of input/output cells which can contain code (Python usually, but other languages can be used), text (using Markdown), mathematics, plots and rich media, that can be executed step by step or in full, in a very easy to use environment, in a LABDRIVE-integrated computational environment.

The source code used to create, read and analyze scientific and research data is usually created by the researchers as Jupyter notebooks, and must also be preserved, along with the datasets. It is usually the best existing Provenance and Structure metadata for the dataset.

LABDRIVE allows users to keep the Jupyter notebooks in which they have the code that reads and "understands" their data as part of the dataset they are creating.

Possibilities are endless, as you can not only keep the original source code preserved, but also execute it without leaving LABDRIVE.

Storage

When using LABDRIVE to manage and preserve your content, you may have some content that you want to be immediately accesible for your users, while for another content you may want to benefit from a lower storage cost if you do not need to access it frequently. This is usually referred as hot or cold storage.

It has been common to use temperature terminology, specifically a range from cold to hot, to describe the levels of tiered data storage. The levels have been differentiated according to the users' needs and how frequently it needs to be accessed. These terms likely originated according to where the data was historically stored: hot data was close to the heat of the spinning drives and the CPUs, and cold data was on tape or a drive far away from the data center floor.

There are no standard industry definitions of what hot and cold mean when applied to data storage, so they are used in different ways by different providers. Generally, though, hot data requires the fastest and most expensive storage because it is accessed more frequently, and cold (or cooler) data that is accessed less frequently can be stored on slower, and consequently, less expensive media.

LABDRIVE offers a great degree of flexibility when managing storage: Data in containers can be preserved using multiple storage providers, can be migrated from one storage provider to another one, and data can be distributed across more than one storage provider.

Files inside a container can be in different storage classes (cold, hot, etc) and LABDRIVE supports changing them using its API, the Management Interface or by using LABDRIVE Functions, that allow certain advanced uses (such as moving your files to a colder storage when they are not used, for instance).

LABDRIVE is using a storage-provider agnostic approach: the platform is able to work with multiple storage backends.

Since LABDRIVE version 2021.05.20, Amazon AWS is used as the storage provider when using LABDRIVE Cloud (as a service), and additional storage providers will be supported in the future.

You can see how to transition from one storage class to another in Storage mode transitions.

Storage types and storage classes

Every object preserved in LABDRIVE is assigned to a storage that has a type and a class. Storage type refers to the provider and geographic location of the underlying storage (AWS in Europe, for instance), while the storage class defines the mode in which the file is, from the range offered by the provider in the region (S3 standard, S3 cold).

Main concepts are:

  • Ingestion storage class: storage class used to upload content to the platform. LABDRIVE will keep files in this class for a pre-defined period of time (to allow their processing) and, then, initiates a transfer process to the Default Storage Class.

  • Default Storage Class: storage class that the user wants for the new uploaded content to go.

  • Current Storage Class: storage class in which a file is at a given point in time.

Storage when using Amazon AWS cloud storage

Described top to down, your LABDRIVE instance is linked to one or more AWS S3 buckets. Each Bucket contains one or more Data Containers, each Data container contains one or more files:

Each S3 Bucket is in a data region, making it possible to distribute your data into multiple regions. The following regions are available:

  • US East (Ohio)

  • US East (N. Virginia)

  • US West (N. California)

  • US West (Oregon)

  • Africa (Cape Town)

  • Asia Pacific (Hong Kong)

  • Asia Pacific (Mumbai)

  • Asia Pacific (Osaka)

  • Asia Pacific (Seoul)

  • Asia Pacific (Singapore)

  • Asia Pacific (Sydney)

  • Asia Pacific (Tokyo)

  • Canada (Central)

  • China (Beijing)

  • China (Ningxia)

  • Europe (Frankfurt)

  • Europe (Ireland)

  • Europe (London)

  • Europe (Milan)

  • Europe (Paris)

  • Europe (Stockholm)

  • South America (São Paulo)

  • Middle East (Bahrain)

[NOTE: more regions will be available in the immediate future]

File version history

LABDRIVE automatically keeps your content versioned, retaining all previous versions for the files, and allowing users to recover lost files if needed.

Other measures are in place to protect the content when managing multiple versions, like warnings on potential data loss:

Data integrity

There are two relevant areas when thinking about data integrity:

  • When the content is ingested or retrieved from LABDRIVE

  • While the content is IN LABDRIVE.

When the content is ingested or retrieved from LABDRIVE

It is important for the user to keep the custody/integrity chain when moving data across platforms. Fixity information/hashes should be extracted from the source and verified again when the content has been ingested in LABDRIVE, making sure that everything has been ingested correctly.

LABDRIVE offers plenty of options to do it, ranging from md5/sha1 manifest verification, BagIt packages support (version .97 and 1.0), API queries, CSV files, and several additional options.

While the content is IN LABDRIVE

LABDRIVE takes care of generating and verifying the integrity of your content periodically and in response to certain events, like migrations across storage providers, retrievals, etc

LABDRIVE automatically generates file hashes on file upload using multiple algorithms (17 out of the box, including CRC, Adler32, MD5, SHA1, SHA256, SHA512, etc), and even allows the users to define their own hashing algorithms.

Functions

LABDRIVE Functions allow organizations to define the code that the platform will execute on certain platform events, or to be available for the users to execute on demand.

Functions are useful when you want for the platform to behave in a specific way in response to external events, or when you want to add your own code to be executed on demand by you. You could perform your actions externally using the API, but functions offer the following advantages:

  • Additional integration. Set it and forget it: For example, with an API-based approach, only power users would be able to perform actions with the data, while with functions, you can define the code you would like to execute when any user uploads content, and the platform will always execute it. With this approach, LABDRIVE will execute the required code without anyone needing to remember or knowing about it.

  • Better performance: The computing is running close to the data.

  • Easier implementation: The logic is included in the platform, without needing external API calls, scripts, etc.

There are many cases in which certain processing of the uploaded content is required. For instance, the user may want for LABDRIVE to automatically uncompress every compressed file that is uploaded to the platform, or to perform an automatic validation of every BagIt package uploaded, to get notified every time a user uploads larger datasets or to validate every uploaded file to check if the file is corrupted. All these use cases can be easily resolved with the LABDRIVE Functions.

Functions are launched by triggers, than can also be defined by the users:

And finally, Functions can also be called on demand by the user, using the web interface:

or even the API, to perform complex operations easily and in a highly scalable way:

Reports

LABDRIVE includes many out-of-the-box reports that can be used immediately:

Reports can be scheduled, to be generated at a later time.

Additionally, using Jupyter Notebooks and several integrated packages as mumpy, matplotlib, seaborn and bokeh, users can create their own dynamic reports and data analysis tools. Reports can be related to the content that is stored in the platform, or even analyze the content of the datasets!

File transfer and sharing methods

Scalable upload and download platform, tested in real world use cases to be capable of sending/receiving more than 1PB of content in less than 24 hours.

Multiple upload/download protocols are supported, including S3.

Drag and drop using the web browser is also available for the users that prefer a simpler approach:

Preserved files can be publicly shared using the platform, to allow access to external users to the content without needing authentication.

Users, permissions and federated access

LABDRIVE includes a highly granular permissions schema, including read or read/write permissions to the container level, groups management, etc:

Users can log-in using built-in accounts, or using SAML-based authentication schemas involving federated organizations with complex access needs. LABDRIVE includes support for automatic account creation and automatic permissions assignment based on the attributes delivered by the Identity Provider.

As LIBNOVA is part of the UK Access Federation, almost every university in the world would be ready to start using it with a very simple configuration.

System architecture

LABDRIVE Cloud is running over a Kubernetes cluster, running on top of Amazon AWS infrastructure, offering multiple regions across the globe and impressive scalability and reliability.

LABDRIVE has been tested to receive 50 million files and 1PB of data in less than 24 hours, scaling itself to more than 6500 Kubernetes pods to process the workload.

Last updated