Overview

Data containers

Every object in Flexible Intake is preserved in a Data Container. Data containers define the policies, functions, permissions and the underlying storage policy for the files they contain, and have many similarities with Amazon S3 buckets or Azure containers, as they can hold files, folders, metadata, etc.

‌Metadata

‌Flexible Intake allows users to add metadata (texts, dates, numbers, etc.) to any file, folder or data container preserved in the system. Functions can be used to automatically generate metadata on ingest or it can be added manually, or both. Metadata can be used for searching (even using complex queries), exported or consumed by other systems.

Combining the possibility to keep metadata associated to the objects with a powerful search engine, makes it possible to perform complex actions in an easy way, for instance: Download all objects in the repository with a given title or for a certain date.

On top of it, organizations can use the Flexible Intake Functions to extract or create metadata automatically upon upload. For instance, if your object is described in ArchivesSpace, you can get the ArchivesSpace record metadata and associate it to the Flexible Intake object.

Several types of metadata exist in Flexible Intake. Some are generated by the platform (like the events the users generate, or the characterization of the preserved files), while others are usually introduced by the user, such as descriptive metadata, for which the user is capable of selecting from multiple options when creating a metadata schema, such as strings, dates, enumerated values, links to other objects, etc.

Creating OAIS Representation Information Networks is also possible, as Flexible Intake is fully aligned with OAIS:

Automation and digital notebooks

‌Flexible Intake comes with integrated Jupyter Notebooks.

Jupyter notebooks are documents that contain an organized list of input/output cells which can contain code (Python usually, but other languages can be used), text (using Markdown), mathematics, plots and rich media, that can be executed step by step or in full, in a very easy to use environment, in a Flexible Intake-integrated computational environment.

Do you want to create a document describing your collection or structure? You can create a notebook and edit it inside Go.

Do you want to create a index page with a thumbnail of all your images in a folder? Create a notebook and add your code. Do you want to rename your objects or execute a complex process? If it can be scripted, you can create a Jupyter Notebook, place your script on it and execute it.

Storage

When using Flexible Intake to manage and preserve your content, you may have some content that you want to be immediately accesible for your users, while for another content you may want to benefit from a lower storage cost if you do not need to access it frequently. This is usually referred to as hot or cold storage.

‌It has been common to use temperature terminology, specifically a range from cold to hot, to describe the levels of tiered data storage. The levels have been differentiated according to the users' needs and how frequently data needs to be accessed. These terms likely originated according to where the data was historically stored: hot data was close to the heat of the spinning drives and the CPUs, and cold data was on tape or a drive far away from the data center floor.

There are no standard industry definitions of what hot and cold mean when applied to data storage, so they are used in different ways by different providers. Generally, though, hot data requires the fastest and most expensive storage because it is accessed more frequently, and cold (or cooler) data that is accessed less frequently can be stored on slower and consequently less expensive media.

Flexible Intake offers a great degree of flexibility when managing storage: Data in containers can be preserved using multiple storage providers, can be migrated from one storage provider to another, and data can be distributed across more than one storage provider.

Files inside a container can be in different storage classes (cold, hot, etc.) and Flexible Intake supports changing them by using its API, the Management Interface or by using Flexible Intake Functions, that allow certain advanced uses (such as moving your files to a colder storage when they are not used, for instance).

Storage types and storage classes

Every object preserved in Flexible Intake is assigned to a storage that has a type and a class. Storage type refers to the provider and geographic location of the underlying storage (AWS S3 in Frankfurt or AWS S3 in Virginia, for instance), while the storage class defines the mode in which the file is, from the range offered by the provider in the region (S3 standard, S3 cold).

‌Main concepts to master the Flexible Intake storage are:

  • Ingestion storage class: Storage class used to upload content to the platform. Flexible Intake will keep files in this class for a predefined period of time (to allow their processing) and then initiates a transfer process to the Default Storage Class (if it is different from the Ingestion storage class).

  • Default Storage Class: Storage class that the user wants for the new uploaded content to go.

  • Current Storage Class: Storage class in which a file is at a given point in time.

Storage using Amazon AWS cloud storage

Described top to down, your Flexible Intake instance is linked to one or more AWS S3 Buckets. Each Bucket contains one or more Data Containers, each Data Container contains one or more files:

Each S3 Bucket is in a data region, making it possible to distribute your data into multiple regions.

File version history

‌Flexible Intake automatically keeps your content versioned, retaining all previous versions for the files, and allowing users to recover lost files if needed.

Other measures are in place to protect the content when managing multiple versions, like warnings on potential data loss:

Data integrity

There are two relevant areas when thinking about data integrity:

  • When the content is ingested or retrieved from Flexible Intake

  • While the content is IN Flexible Intake.

When the content is ingested or retrieved from Flexible Intake

It is important for the user to keep the custody/integrity chain when moving data across platforms. Fixity information/hashes should be extracted from the source and verified again when the content has been ingested in Flexible Intake, making sure that everything has been ingested correctly.

Flexible Intake offers plenty of options to do it, ranging from md5/sha1 manifest verification, BagIt packages support (version .97 and 1.0), API queries, CSV files and several additional options.

While the content is IN Flexible Intake

Flexible Intake takes care of generating and verifying the integrity of your content periodically and in response to certain events, like migrations across storage providers, retrievals, etc.

Flexible Intake automatically generates file hashes on file upload using multiple algorithms and even allows the users to define which ones they would like to use (17 out of the box, including CRC, Adler32, MD5, SHA1, SHA256, SHA512, etc.).

Functions

‌Flexible Intake Functions allow organizations to define the code that the platform will execute on certain platform events, or to be available for the users to execute on demand.

Functions are useful when you want for the platform to behave in a specific way in response to external events, or when you want to add your own code to be executed on demand by you. You could perform your actions externally using the API and your own scripts, but Functions offer the following advantages:‌

  • Additional integration. Set it and forget it: For example, with an API-based approach, only power users would be able to perform actions with the data, while with Functions, you can define the code you would like to execute when any user uploads content, and the platform will always execute it. With this approach, Flexible Intake will execute the required code without anyone needing to remember or know about it.

  • Better performance: The computing is running close to the data.

  • Easier implementation: The logic is included in the platform, without needing external API calls, scripts, etc.

There are many cases in which certain processing of the uploaded content is required. For instance, the user may want for Flexible Intake to automatically uncompress every compressed file that is uploaded to the platform, to perform an automatic validation of every BagIt package uploaded, to get notified every time a user uploads a larger file or to validate every uploaded file to check if the file is corrupted. All these use cases can be easily resolved with the Flexible Intake Functions.

‌Functions are launched by triggers (when this happens, do that):

And finally, Functions can also be called on demand by the user, using the web interface:

or even the API, to perform complex operations easily and in a highly scalable way:

Reports

Flexible Intake includes many out-of-the-box reports that can be used immediately:

Reports can be scheduled to be generated at a later time.

Additionally, using Jupyter Notebooks and several integrated packages as mumpy, matplotlib, seaborn and bokeh, users can create their own dynamic reports and data analysis tools. Reports can be related to the content that is stored in the platform, or even used to analyze your collections!

File transfer and sharing methods

Uploading content in and out of the platform is easy and fast. Flexible Intake has been tested in real world use cases to be capable of sending/receiving more than 1PB of content in less than 24 hours.

Multiple upload/download protocols are supported, including S3.

Drag and drop using the web browser is also available for the users that prefer a simpler approach:

Preserved files can be publicly shared using the platform, to allow external users to access the content without needing authentication.

Users, permissions and federated access

‌Flexible Intake includes a highly granular permissions schema, including read or read/write permissions to the container level, groups management, etc.:

Users can log-in using built-in accounts, or using SAML-based authentication schemas involving federated organizations with complex access needs. Flexible Intake includes support for automatic account creation and automatic permission assignment based on the attributes delivered by the Identity Provider.

As LIBNOVA is part of the UK Access Federation, almost every university in the world would be ready to start using it with a very simple configuration.

Last updated