Planning for preservation

By: David Giaretta (head of the OAIS and ISO 16363 working group)

To prepare for the preservation of any type of information one must keep the data object (the bits) safe. In order to ensure that the information can remain understandable, with evidence of its authenticity, one must collect enough “metadata”. More precisely the repository must collect enough of each of the types of “meta-information” i.e. information about information, which OAIS identifies, in order to be able to preserve the information of interest. The reason for using the term meta-information, which is particularly relevant when preserving scientific information, is that the “metadata” such as source code or calibration file are not, by themselves, useful and instead one needs to know how to use them i.e. they must be turned into information by the addition of Representation Information such as make files and documentation for the source code and detailed algorithms and documentation for the calibration files..

OAIS discusses how to decide whether one has enough of each of the types of meta-information that is needed. A minimum amount of each is specified by OAIS, with suggestions for additional amounts in order to make preservation easier in future.

The following sections provide some specific suggestions for the minimum and for extended amounts of each type of meta-information.

There are two extreme situations to consider for the collection of meta-information. The first is where the information has already been created and the repository needs to preserve it. The second is where the information creators are just starting the process of creating information which the repository will eventually be asked to preserve and so are in a position to collect as much “metadata” as needed. Other situations will fall between these extremes.

The next sections discuss the extreme cases, providing suggestions for the amounts of each type of meta-information required by OAIS Archival Information Packages, in other words everything needed to preserve the object of interest. In addition OAIS identifies other pieces of information needed for discovery of the object and also to help guide and test preservation activities.

Planning to preserve pre-existing information

AIP components

The components of an AIP are the ones needed for preservation. COnsidering each of these in turn, we can identify the minimum information that has to be collected.

Representation Information

Minimum

The repository needs to define the Designated Community for this dataset or collection of data. This then allows it to determine the minimum amount of Representation Information required.

It may be that no Representation Information is required at this moment.

Extended

it would be advisable to draw out the Representation Information Network in order to capture as much as possible right now, because some of it may not be available in the future.

The Representation Information includes:

  • structure,

  • semantics including the relationship between data elements,

  • other Representation Information such as analysis and display software.

In some projects the Representation Information may be captured in a number of formal documents. In others, especially those which extend over many years or even decades, there are likely to be a number of pieces of Representation Information which are not formally captured. For example, there may be information which “everyone knows” such as:

  • modelling and designs;

  • annotation systems used with the data (if any);

  • the way in which software libraries are named or organized;

  • the meaning of comments e.g., “will run on Cray-like machines” – may actually mean the software must be built on machines which use double-precision floating point numbers by default;

  • compiler bugs which must be worked around;

  • the meaning of elements of the data header (if any);

  • the location of documentation for proprietary systems;

  • quality flags and magic values (care needed when transformed) or special values representing NULL or missing values.

The archive may start with the minimum amount of Representation Information but should try to capture as much of the extended Representation Information, which is not required right now for the Designated Community, but which may be needed in the future. The archive system must certainly be able to accomodate and preserve however much Representation Information will be needed in future - this is something which LABDRIVE is able to do.

Provenance

Minimum

The minimum Provenance Information is the simple fact of from whom the object came. The way in which this is recorded may be a simple text file, which itself has Structure Representation Information which may be that the text encoding is ASCII, and the Semantic Representation Information is that document is simple English text.

LABDRIVE also automatically keeps a record of events which have affected the data object, and these should be added to the Provenance Information.

Extended

Extended Provenance Information could include whatever can be found out about from where the information comes, who has been responsible for it, how it has been processed. The Provenance may be encoded using PREMIS, in which case the Structure RI is the definition of PREMIS and the Semantic RI would specify any special vocabulary describing/defining the events or agents.

Provenance Information provides information including

  • specific aspects of the project origins and history,

    • Mission documentation including

      • Mission architecture documents describing purpose, scope and performances of the mission and of the on-board instruments, information relevant orbits, platform position, attitude, ground coverage (acquisition footprint), head-roll-pitch.

      • Documents describing data and products formats specification.

      • Documents describing measurement requirements and/or measurement performances (theoretical models). Documents drawing instruments characteristics, performances and instrument description (physical implementations).

      • Documents describing models and/or algorithms needed (used) to obtain mission data and products including specific/special cases, known errors and configuration necessities. In other words, all documents covering conceptual environment, its implementation and its operations.

      • Reports concerned with measurement trends, failures, changes of performances, and out of service for any reason.

      • Documents related to the process of data qualification: precision, numerical representations, formats, uncertainties, errors, adjustment/correction methods (e.g., Cal/Val procedures and documents).

    • from what it was derived i.e., previously collected data;

    • processing software;

    • what data is related;

  • data custody – who was in control of the data at various points in the project;

  • version control – what, if any, version control was used for the data;

  • calibration and test;

  • data products from which this information was derived, or example Level 0, Level 1 etc.;

  • processing hardware/software;

  • processing logs;

  • how the quality of the information may be checked;

  • Migration management;

  • Management of copies of the data;

  • Synchronisation policy of copies;

  • Defence against hacking;

  • Which anti-virus checks performed;

  • Roles of people e.g., who can change/delete.

Provenance Information is information which should by default be preserved throughout the project and beyond because of its importance as evidence for Authenticity and its value for reproducibility.

Reference

Minimum

The minimum information could be a URL or DOI assigned to the object. The Reference Information is made up of a character string plus the explanation, for example DOI: 10.1080/15588742.2015.1017687 with the explanation of how to interpret and use this character string.

Extended

An extended Reference Information could include alternative identifiers, for example a URN and ARK as well at the DOI, so that if one resolver system ceases to work then one of the others should continue. Additionally there are

  • Identifiers used in publications

  • Naming conventions used in internal systems

    • How versions/editions are dealt with e.g., numerical or time tagged versions

  • Reasons for selecting a particular referencing convention

Access Rights

Minimum

Access Rights may be provided in the form of a simple text file, written in simple English, encoded in ASCII or PDF.

Extended

Entended Access Rights Information may be provided in the form an access control language such as XACML (eXtensible Access Control Markup Language).

Access Rights Information could include:

  • Ownership;

  • copyright and licensing or access restrictions and documents authorizing use;

  • confidentiality/privacy/sensitivity/security constraints, including General Data Protection Regulation (GDPR) if applicable;

  • Embargoes on data publication;

  • Legal implications if data is released;

  • Licences used to create, use, distribute information;

  • Designated Community;

  • Legal framework(s);

  • Licensing offers;

  • Specifications for rights enforcement measures applied at Dissemination time

  • Pointers to Fixity and Provenance Information (e.g., digital signatures, and rights holders)

Context

Minimum

The Context Information may be a simple text document, for example as ASCII text or PDF, in English, which describes why the information is being preserved.

Extended

Extended Context Information might include why the information being preserved was created and how it relates to other other Information objects existing elsewhere. Other examples include:

  • Broader aspects of the project origins and history

  • The scope of the information collection and any changes in scope which may have occurred during the project

  • Funders

  • Current Research Information Systems (CRIS) information

  • Cultural heritage context

  • Research publications based on the data

  • Publications containing the data.

Fixity

Minimum

The minimum Fixity Information could be the hash or digest create by the Producer, to be compared with the hash created by the repository. The Fixity Information is made up of a character string such as "947c10fd" created with algorithm "adler32".

In addition details must be provided, for example as a text file, of the way in which the repository ensures that the objects being preserved are not alter in an undocumented way. This could include the schedule of re-calculation of the hashes and the way in which the repository can ensure that the original hashes themselves have not changed.

LABDRIVE automatically calculates and records a number of hashes for the objects ingested into the system. These hashes are periodically checked to ensure that the objects are unchanged. These collections of hashes should be kept as part of the Fixity Information.

Extended

Multiple hash algorithms may be used, for example

  • "947c10fd" created with algorithm "adler32

  • "Df9fabe58a0b1515e622674fda12233c" created with algorithm "md5"

  • "3228a6441a7ef04d618d11ef96d6d04bc3aa46d6" created with algorithm "sha1"

It could also include:

  • digests and Checksums – how they were calculated and where they are kept;

  • description of how the digests are safeguarded - where they are kept and who can change them;

  • logs of fixity checks and any problems detected.

AIP Supplementary Information

In order to help to make decisions about preservation activities the repository should record a number of key points.

Designated Community

The Designated Community for a particular piece of information, or collection of similar pieces of information, is specified by the repository. It is a group of potential Consumers who should be able to understand a particular set of information in ways exemplified by the Preservation Objectives (see below). The Designated Community may be composed of multiple user communities. A Designated Community may change over time.

The definition of the Designated Community determines how much Representation Information is needed, and how the amount needed may change over time.

The Designated Community for a collection of FITS files may be astronomers. At the moment an astronomer knows how to deal with FITS files, and has FITS software readily available, but in future FITS may be replaced by other formats for astronomy and the software may not be readily available.

Transformational Information Properties

Keeping the value of a Transformation Information Property unchanged is regarded as being necessary but not sufficient to verify that any Non-Reversible Transformation has adequately preserved information content. This could be important as contributing to evidence about Authenticity. Such an Information Property is dependent upon specific Representation Information, including Semantic Representation Information, to denote how it is encoded and what it means.

Preservation Objectives

A preservation objective is defined by the repository and is a specific achievable aim which can be carried out using the Information Object. AN example could be the ability to understand a dataset and use it in analysis tools to generate results, for example the density of electrons in the upper atmosphere or the structure of a molecule, given certain measurements. These could be compared with results generated earlier.

Information Discovery

Descriptive Information

OAIS does not provide any specific details or requirements for Descriptive Information. Its purpose is to allow Consumers to locate information of potential interest, analyze that information, and order desired information. In fact it may be no more than a descriptive title

or it may be a full set of attributes that are searchable in a catalog service.

Planning to preserve information being created

The creation of some information which someone may wish to preserve may be as simple as writing a single small document or spreadsheet, taking one person just a few minutes.

On the other hand one a large scientific dataset may involve many separate teams of hundreds of people using huge resources over many years.

At each stage, the various types of meta-information described above may be gathered before it is lost or forgotten.

For long-term preservation all the pieces of Representation Information that “everyone knows” should be captured in as much detail as possible.

Each piece of Representation Information, and possibly its own Representation Information, and so on. OAIS describes this as a Representation Information Network (RIN).

The amount of Representation Information which the archive will eventually require will depend upon the Designated Community which the archive serves. It may be useful to work with the archive to draft the RIN as early and in as much detail as possible.

Reference Information, Provenance Information, Context Information, Fixity Information and Access Rights Information should also be captured.

Of these the Provenance is likely to be required over the entire life of the project, and beyond, being relevant to all subsequent outcomes of the project. Reference, Fixity, Context and Access Rights Information may be required through all successive stages, if relevant to the Provenance and if available.

Last updated