Platform architecture

LIBSAFE Go Cloud runs over the Amazon AWS Cloud platform and uses AWS S3 storage as its primary storage technology.

The platform is Linux-based, using Kubernetes and other cloud native components at its core, with the following architecture:

Section I is comprised of the core platform modules, namely:

  • Group 1: User-facing services, including:

    • A: the user web interface and the API server

    • B: Any additional protocol support module that is installed (to provide compatibility for other non-S3 file transfer protocols, such us XrootD, SFTP, etc.

    • C: The module that runs the Jupyter Notebooks interfaces (and their corresponding VMs), that are running in an isolated environment.

  • Group 2: Processing agents' cluster. Under the hood, the platform process data in many ways. Agents are in charge of the processing, and all Agents are grouped here. Agents can't interact with the user directly, and are running in Kubernetes pods as services, including:

    • D: File validation agents, that characterize and validate uploaded files, including analysing them for malware detection.

    • E: Property extractors, that extract metadata from ingested content.

    • F: Actions, that run user's and system Functions.

    • G: Reports, that execute lengthy functions that generate Reports and data analysis.

    • H: Infrastructure managers, that create content replicas, manage storage type transitions and other infrastructure-related actions.

  • Group 3: Manages the SAML and SSO interactions, automated account creation and user authentication process.

  • Group 4: Includes the platform security modules and the message queue that coordinates platform's events (rabbitMQ)

Section II is comprised of the auxiliary elements that are need for the platform to work:

  • Managed Kubernetes platform

  • Database service

  • ElasticSearch cluster

  • AWS S3 storage platform

Storage

When using the platform to manage and preserve your content, you may have some content that you want to be immediately accesible for your users, while for another content you may want to benefit from a lower storage cost if you do not need to access it frequently. This is usually referred as hot or cold storage.

It has been common to use temperature terminology, specifically a range from cold to hot, to describe the levels of tiered data storage. The levels have been differentiated according to the users' needs and how frequently it needs to be accessed. These terms likely originated according to where the data was historically stored: hot data was close to the heat of the spinning drives and the CPUs, and cold data was on tape or a drive far away from the data center floor.

There are no standard industry definitions of what hot and cold mean when applied to data storage, so they are used in different ways by different providers. Generally, though, hot data requires the fastest and most expensive storage because it is accessed more frequently, and cold (or cooler) data that is accessed less frequently can be stored on slower, and consequently, less expensive media.

The platform offers a great degree of flexibility when managing storage: Data in containers can be preserved using multiple storage providers, can be migrated from one storage provider to another one, and data can be distributed across more than one storage provider.

Files inside a container can be in different storage classes (cold, hot, etc) and the platform supports changing them using its API, the Management Interface or by using platform Functions, that allow certain advanced uses (such as moving your files to a colder storage when they are not used, for instance).

Storage types and storage classes

Every object preserved in the platform is assigned to a storage that has a type and a class. Storage type refers to the provider and geographic location of the underlying storage (AWS in Europe, for instance), while the storage class defines the mode in which the file is, from the range offered by the provider in the region (S3 standard, S3 cold).

Main concepts are:

  • Ingestion storage class: storage class used to upload content to the platform. The platform will keep files in this class for a pre-defined period of time (to allow their processing) and, then, initiates a transfer process to the Default Storage Class.

  • Default Storage Class: storage class that the user wants for the new uploaded content to go.

  • Current Storage Class: storage class in which a file is at a given point in time.

Storage when using Amazon AWS cloud storage

Described top to down, your instance is linked to one or more AWS S3 buckets. Each Bucket contains one or more Data Containers, each Data container contains one or more files:

Each S3 Bucket is in a data region, making it possible to distribute your data into multiple regions. The following regions are available:

  • US East (Ohio)

  • US East (N. Virginia)

  • US West (N. California)

  • US West (Oregon)

  • Africa (Cape Town)

  • Asia Pacific (Hong Kong)

  • Asia Pacific (Mumbai)

  • Asia Pacific (Osaka)

  • Asia Pacific (Seoul)

  • Asia Pacific (Singapore)

  • Asia Pacific (Sydney)

  • Asia Pacific (Tokyo)

  • Canada (Central)

  • China (Beijing)

  • China (Ningxia)

  • Europe (Frankfurt)

  • Europe (Ireland)

  • Europe (London)

  • Europe (Milan)

  • Europe (Paris)

  • Europe (Stockholm)

  • South America (São Paulo)

  • Middle East (Bahrain)

[NOTE: more regions will be available in the immediate future]

Storage/Content replication

When users ingest content, it is stored in the primary storage, either by using the web interface (1) or by directly-accessing the S3 storage (2) by the users. The Platform's replication and integrity services (3) are then in charge of creating an additional copy in the Active escrow storage (4) and, optionally, in the S3-compatible Customer provided storage (5):

Last updated