LIBSAFE Advanced support for OAIS Conformance
Last updated
Last updated
This section has been created by David Giaretta (head of the OAIS and ISO 16363 working group)
Any object, in particular, any file in LIBSAFE Advanced can be associated with a metadata schema which allows all the required components identified by the OAIS Information Model to be associated with that object, which would therefore be the Data Object. The following image shows the elements in that schema, including "Structure Representation Information", "Provenance", etc.
A simple example, which shows a completed schema is that for a MAGIC FITS file:
Note that most of the entries here are simple text files, but could be any file. Moreover, these are links, which allows the same file, for example the Semantic Representation Information, to be applied to millions of MAGIC FITS files.
The Package Description is an optional simple description of the object, but also includes the indexing information kept by LIBSAFE Advanced.
Using CURL to obtain the metadata associated with the file, one gets:
The JAVA code which extracts the details for each element of the OAIS IM Schema is outlined here:
Each of the objects themselves can also be associated with all the OAIS Information Model elements, which may be expanded as follows:
The text is hard to read in this image, but one can see that the various components themselves can have Representation Information and PDI. Also, because many of the files are simple ASCII text files written in English, many have the same Structure Representation Information, namely ASCII__Text__Definition, only one copy of which is needed. A larger image may be downloaded by clicking the following link.
The amount of Representation Information required for the AIP depends upon the definition of the Designated Community. Clearly, if the Designated Community consists of the current users, then they will not need any extra information to understand/use the data.
However, as time passes, more Representation Information will need to be added; for example, the software required may not be available on the Internet. Therefore, we need to be sure that the archive system will allow that to be done.
The same applies to the Provenance Information, and the other objects required for preservation. Current users will probably know everything about these pieces of information, but over time additional information, in particular Representation Information, will be needed.
It may also be that all the objects in a specific container require the same amount of Representation Information, and perhaps provenance. In which case these may be applied to the container as a whole.
In order to support Interoperability and Re-usability, Representation Information is essential. Therefore, the archive system must allow the addition of more Representation Information than the current Designated Community requires.
For example, if the description of the way in which FITS is used is not made available, then the meaning of the keyword EFFICIEN in the FITS header may not be understood by someone from a different community who wishes to R_e-use that data or combine that data with other data (I_nteroperate).
Besides preserving information, such as scientific information, LIBSAFE Advanced allows one to preserve software systems, whether source code or complete virtual machines, each with the Representation Information needed to be able to use them.
In more detail, one can look at each of the FAIR requirements in turn.
FAIR Principles state that: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
F1. (Meta)data are assigned a globally unique and persistent identifier
Objects within LIBSAFE Advanced are assigned unique identifiers.
F2. Data are described with rich metadata (defined by R1 below)
LIBSAFE Advanced allows very flexible metadata schemas, including the OAIS AIP Schema.
F3. Metadata clearly and explicitly include the identifier of the data they describe
Queries can be used to identify which data object refers to which piece of metadata, but in general this is normaly many to one i.e. the FITS Structure Representation Information may be referred to by billions of FITS data objects. Instead, the data object has metadata schema which refer to its relevant metadata.
F4. (Meta)data are registered or indexed in a searchable resource
All objects have identifiers and are indexed by LIBSAFE Advanced .
FAIR Principles state that: Once the user finds the required data, she/he/they need to know how can they be accessed, possibly including authentication and authorisation.
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
LIBSAFE Advanced supports multiple communications protocols.
A1.1 The protocol is open, free, and universally implementable
LIBSAFE Advanced's default configuration is to support open, free and universally implementable protocols such as HTTP.
A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
LIBSAFE Advanced can be configured for each object to have sophisticated authorization schemes and multiple authentication capabilities.
A2. Metadata is accessible, even when the data are no longer available
LIBSAFE Advanced can be configured to preserve any object including metadata.
FAIR Principles state that: The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
LIBSAFE Advanced allows any type of metadata, including those using formal knowledge representation languages, such as ontologies for semantics (e.g. Semantic Representation Information) and detailed structure descriptions (e.g. Structure Representation Information), such as DRB, ASN.1 and EAST. Each of these metadata objects can also have its own metadata.
I2. (Meta)data use vocabularies that follow FAIR principles
LIBSAFE Advanced can be configured to use any vocabulary.
I3. (Meta)data include qualified references to other (meta)data
LIBSAFE Advanced schema allows references to other (meta)data objects.
FAIR Principles state that: The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
LIBSAFE Advanced allows the repository staff to add extensive attributes through the schema.
R1.1. (Meta)data are released with a clear and accessible data usage license
Access to (meta)data is controlled through a comprehensive set of authorization systems.
R1.2. (Meta)data are associated with detailed provenance
LIBSAFE Advanced automatically keeps a detailed record of events which happen to any object and further detailed provenance can be added.
R1.3. (Meta)data meet domain-relevant community standards
LIBSAFE Advanced can be configured to support any domain standards.
Information may be sent from the archive to users in a variety of ways. OAIS uses the general term "Information Package", but does not specify much detail for Submission Information Packages and the Dissemination Packages. The package specified in the greatest details, and which presents the greatest challenge is to be able to create and send Archival Information Packages (AIPs). An AIP should contain all the information that is required to preserve the target of preservation, called the Content Information. It plays an important role in digital preservation in that at some point in the future, either the archive system used by the organisation will change or the archive itself may cease to exist or at least may give up responsibility for preserving a sub-set of its collection.
The following sub-sections discuss some ways of creating AIPS.
The organisation may specify that BAGIT should be used as AIPs. The BAGIT definition does not specify what its content should be other than those elements which are to do with validating the hashes.
In order to construct something which contains all the elements required for an OAIS AIP, two files were constructed for the bag.
The file "fetch.txt" is specified in the BAGIT definition and allows the contents to be referred to rather than included in BAG itself. The file contains:
"fetch.txt" identifies all the component files (one must have the correct permissions to allow the links to be downloaded). A flat file structure could be adopted, but in this case a hint is provided by the subdirectory names, e.g. "data/semanticRepInfo/FITS format for MAGIC data.pdf", which indicates that this is Semantic Representation Information.
The file "oais-aip-manifest.txt" defines the structure of the AIP, specifically identifying which file is the Content Data Object, which is its Context, etc. Note that the Context is a simple file i.e. a Data Object, which itself has its own Representation Information, in order to be an Information Object i.e. Context Information. Only Structure Representation Information is provided, but in principle each could have all the elements required for preservation i.e. all the components of an AIP.
The BAGIT file defined here is a complete OAIS AIP as may be seen by verifying that all the immediate elements for the Content Data Object, which is the object of preservation in this example, has all the additional pieces of Information required by the OAIS definition of an AIP.
Each of the elements pointed to is itself a Data Object, and so to make it into a piece of Information, some Representation Information should be added, for an incomplete example:
Versioning for objects, for example to get the latest version of an object, may be accommodated by using keeping all versions of that object in a directory, then referring to that directory instead of any individual object. The object with the latest data of update should be used in order to get the latest version.
Of course, if a specific object is named, then that would be used instead of the latest version.
The OAIS Information Model schema may be accommodated as described below, by assigning common elements to the directory and inheriting those values.
If one has a billion items of data, all with the same Representation Information, then one will naturally want to avoid unnecessarily duplicating that information for each object.
LIBSAFE Advanced allows one to add the schema to the folders and containers and therefore one may inherit the schema components from those parents using the following logic for each schema component, e.g. structureRepInfo:
If the file has that schema component, then
use that
else
if the folder has that component, then
use that
else
if there are parent folders, then
check those iteratively
else
if the container has the schema component, then
use that
else
no such metadata
If all the data in a container has the same Representation Information, then the schema elements structureRepInfo, semanticRepInfo and otherRepInfo may be specified just once for the container.
Provenance (the provenanceOAIS schema element) is different in that the Provenance of the parent objects could be added, e.g. the container may tell us everything inside is part of the MAGIC project, with associated high-level Provenance. Depending on how the data is arranged, the folder may give us the date, location, observatory staff and associated weather conditions, and the file Provenance may give us some specific information about who was the observer. The file events can then give us more details about what happened inside LIBSAFE Advanced . The MAGIC project can arrange the information as they like but this would have the advantage of avoiding duplication of information.
OAIS defines two sub-types of Archival Information Packages. The first is the Archival Information Unit (AIU) which contains essentially one object to be preserved. The example above was the specific type of AIP which has a single target of preservation (the Content Information); in other words, it is an AIU. The second type of AIP is the Archival Information Collection (AIC) which contains multiple individual AIPs. For example, an AIC could be made up of all the AIPs in a LIBSAFE Advanced container. The AIC has an additional piece of metadata which describes the whole collection. However the Representation Information may be the same for all the Content Information of each component AIP. This means that one only needs a single copy of the Representation Information which applies to all the components, thereby avoiding duplication. The same may also be true for Context, Access Rights and the general description of Fixity Information. Each component AIP will have its own hashes and Reference Information.
As noted previously, these responsibilities apply to the organisation and not the software. However, one can describe what the software solution should support in order to enable the archive to meet its responsibilities
1. Negotiate for and accept appropriate information from information Producers.
1.1. LABDRIVE is able to check the SIPs to ensure that they are what is expected and have not been corrupted, having been defined to ensure the AIPs can be created. It can also allow the archive staff to add additional metadata to the packages.
1.2. LIBSAFE Advanced has automated workflows including automated collection of “metadata” of the various types defined by OAIS. Additionally, if the SIPs include, for example, Provenance Information then there can be adequate Representation Information for the way it is encoded. The PREMIS standard is widely used in some domains; in this case, the Representation Information would be the PREMIS standard as well as the specific vocabulary used. Other domains use other Provenance encodings, even “home-grown” systems, all of which would require their own Representation Information. LIBSAFE Advanced can support all these.
2. Obtain sufficient control of the information provided to the level needed to ensure Long Term Preservation.
LIBSAFE Advanced can preserve the proof of control, e.g. to make copies. It also has extensive support for restrictions on access with configurable authentication and authorization capabilities.
3. Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided, thereby defining its Knowledge Base.
LIBSAFE Advanced allows one to identify which Designated Community applies to which object being preserved by adding appropriate schema elements.
4. Ensure that the information to be preserved is Independently Understandable to the Designated Community. In particular, the Designated Community should be able to understand the information without needing special resources such as the assistance of the experts who produced the information.
LIBSAFE Advanced allows the repository to maintain as much of the Representation Information Network as required, including identifying types of Representation Information and links between them, and allow staff to add Representation Information as required. The platform offers functionality to create/collect and link Representation Information required Designated Community, including those which are human-actionable as well as machine-actionable.
The objects being preserved may need to be Transformed to alternative formats. LIBSAFE Advanced allows processing schemes to be set up and applied to datasets. Appropriate Representation Information and Preservation Description Information can be added; Provenance events, Fixity checks and Reference Information will be created automatically.
5. Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, including the demise of the Archive, ensuring that it is never deleted unless allowed as part of an approved strategy. There should be no ad-hoc deletions.
LIBSAFE Advanced Go can be configured to keep as many backup copies, distributed geographically and over different technologies, as desired, with periodic fixity checks. Deletion policies can be configured, with multiple authorizations required.
LIBSAFE Advanced maintains all the Information Objects’ types, defined by OAIS, of metadata, with interfaces to add, edit, import, export or search it.
LIBSAFE Advanced supports the handover of all the information, in particular complete AIPs, to another repository in such a way that the other repository can extract the components of the AIPs as required.
Day-to-day administration is well supported as well as decision support for Management in terms of Preservation Strategies to configure the Archive, taking into account costs, both in terms of financial resources as well as environmental burdens, and risks.
6. Make the preserved information available to the Designated Community and enable the information to be disseminated as copies of, or as traceable to, the original submitted Data Objects with evidence supporting its Authenticity.
LIBSAFE Advanced is able to construct DIPs in a flexible way to support changing demands of all types of consumers, as well as members of the Designated Community. In addition to specific interfaces such as Web GUIs, a general API to query and access holdings allows users to create their own applications.
Provenance Information, to support claims of Authenticity, can be provided, ranging from the origins and previous custodians of the preserved objects as well as detailed events within the repository.
OAIS identifies three basic preservation strategies as time passes and the knowledge base of the Designated Community, including hardware, software, tacit knowledge, changes. It is assumed that the bits of the digital objects will be kept safe, using the capabilities of LIBSAFE Advanced.
The three options may be identified as follows.
The Data Object of the Information being preserved may be:
kept by the Archive unchanged; or
kept by the Archive but may be changed; or
not kept by the Archive, but instead handed on to another Archive.
Each of these three imply the following:
In case 1), the archive may add Representation Information to ensure the Content Information is Independently Understandable.
In case 2), the archive may Transform the Data Object of the Information being preserved.
In case 3), the archive may hand over the AIP which contains the Object being preserved.
For each of these approaches there will be the need to ensure that an Information Object being preserved continues to be Independently Understandable by the Designated Community, the components of its AIP are not lost and are updated appropriately.
If members of the Designated Community (DC) are no longer able to understand the Information being preserved because, for example, they no longer have access to the software required, the Representation Information Network (RIN) can be extended by adding software including emulators, software containers, etc.
LIBSAFE Advanced allows one to add extra Representation Information of any type.
If it is not practical, for some reason, for the repository to add the required Representation Information, then the Data Object can be Transformed to another format. For example, if Word is no longer available, and the Word software cannot be provided as Representation Information, then the Word file can be Transformed to an Open Office format.
When one transforms the Data Object of the information to be preserved, an important consideration is how the repository can say with confidence that the new version may be regarded as an Authentic replacement for the original. The hash value of the new Data Object will of course be different from the original hashes and so this chain of evidence will be cut.
Transformations are likely to lose information, unless it it possible to exactly recover the original file. How can one check that the Transformation is adequate i.e. has not lost important information? For example, Transforming a FITS file to a JPEG file is likely to lose the FITS headers. Bearing in mind that a FITS file may contain multiple images and tables, these will also be lost.
OAIS defines the term Transformational Information Properties which the repository should capture and check that these are unchanged after the Transformation. Examples could include the number of significant digits in the data, special flags values, headers, etc.
LIBSAFE Advanced allows one to Transform Data Object using any methods wanted, and the Transformation Information Properties can be checked using scripts, supplemented by manual checks.
The Archival Information Package (AIP), for the Information to be preserved, must contain all the components needed for preservation.
LIBSAFE Advanced enables one to create complete AIPs, as discussed above, and export these as BAGIT files or other package formats.