LABDRIVE support for OAIS Conformance
Last updated
Last updated
By: David Giaretta (head of the OAIS and ISO 16363 working group)
Any object, in particular any file, in LABDRIVE can be associated with a metadata schema which allows all the required components identified by the OAIS Information Model to be associated with that object, which would therefore be the Data Object. The following image shows the elements in that schema, including "Structure Representation Information", "Provenance" etc.
A simple example which shows a completed schema is that for a MAGIC FITS file:
Note that most of the entries here are simple text files, but could be any file. Moreover these are links, which allows the same file, for example the Semantic Representation Information to be applied to millions of MAGIC FITS files.
The Package Description is an optional simple description of the object but also includes the indexing information kept by LABDRIVE.
Using CURL to obtain the metadata associated with the file one gets:
The JAVA code which extracts the details for each element of the OAIS IM Schema is outlined here:
Each of the objects themselves can also be associated with all the OAIS Information Model elements, which may be expanded as follows::
The text is hard to read in this image but one can see that the various components themselves can have Representation Information and PDI. Also, because many of the files are simple ASCII text files written in English, many have the same Structure Representation Information, namely ASCII_Text_Definition, only one copy of which is needed. A larger image may be downloaded by clicking the following link.
The amount of Representation Information required for the AIP depends upon the definition of the Designated Community. Clearly if the Designated Community consists of the current users, then they will not need any extra information to understand/use the data.
However as time passes, more Representation Information will need to be added, for example the software required may not be available on the Internet. Therefore we need to be sure that the archive system will allow that to be done.
The same applies to the Provenance Information, and the other objects required for preservation. Current users will probably know everything about these pieces of information but over time additional information, in particular Representation Information, will be needed.
It may also be that all the objects in a specific container require the same amount of Representation Information, and perhaps provenance. In which case these may be applied to the container as a whole.
Information may be sent from the archive to users in a variety of ways. OAIS uses the general term "Information Package" but does not specify much detail for Submission Information Packages and the Dissemination Packages. The package specified in the greatest details, and which presents the greatest challenge is to be able to create and send Archival Information Packages (AIPs). An AIP should contain all the information that is required to preserve the target of preservation, called the Content Information. It plays an important role in digital preservation in that at some point in the future either the archive system used by the organisation will change or, the archive itself may cease to exist, or at least may give up responsibility for preserving a sub-set of its collection.
The following sub-sections discuss some ways of creating AIPS.
The organisation may specify that BAGIT should be used as AIPs. The BAGIT definition does not specify what its content should be other than those elements which are to do with validating the hashes.
In order to construct something which contains all the elements required for an OAIS AIP two files were constructed for the bag.
The file "fetch.txt" is specified in the BAGIT definition and allows the contents to be referred to rather than included in BAG itself. The file contains:
"fetch.txt" identifies all the component files (one must have the correct permissions to allow the links to be downloaded). A flat file structure could be adopted but in this case a hint is provided by the subdirectory names e.g. "data/semanticRepInfo/FITS format for MAGIC data.pdf", which indicates that this is Semantic Representation Information.
The file "oais-aip-manifest.txt" defines the structure of the AIP, specifically identifying which file is the Content Data Object, which is its Context etc. Note that the Context is a simple file i.e. a Data Object, which itself has its own Representation Information, in order to be an Information Object i.e. Context Information. Only Structure Representation Information is provided but in principle each could have all the elements required for preservation i.e. all the components of an AIP.
The BAGIT file defined here is a complete OAIS AIP as may be seen by verifying that all the immediate elements for the Content Data Object, which is the object of preservation in this example, has all the additional pieces of Information required by the OAIS definition of an AIP.
Each of the elements pointed to is itself a Data Object, and so to make it into a piece of Information, some Representation Information should be added, for an incomplete example:
Versioning for objects, for example to get the latest version of an object, may be accommodated by using keeping all versions of that object in a directory, then referring to that directory instead of any individual object. The object with the latest data of update should be used in order to get the latest version.
Of course if a specific object is named then that would be used instead of the latest version.
The OAIS Information Model schema may be accommodated as described below, by assigning common elements to the directory and inheriting those values.
If one has a billions items of data, all with the same Representation Information then one will naturally want to avoid unnecessarily duplicating that information for each object.
LABDRIVE allows one to add the schema to the folders and containers and therefore one may inherit the schema components from those parents using the following logic for each schema component e.g. structureRepInfo:
If the file has that schema component then
use that value
Note that for the specific file then for Fixity one should include the values of the hashes LABDRIVE automatically calculates, which include the hash value and the algorithm e.g.:
"file_hash": [ { "id": "391", "file_id": "128", "hash": "a9ddcea04b67c4b635b9a0504e6fa3ff", "algo": "etag" }, { "id": "395", "file_id": "128", "hash": "a9ddcea04b67c4b635b9a0504e6fa3ff", "algo": "md5" }, { "id": "396", "file_id": "128", "hash": "a5597f04de027b7b029ea80c86c0302a2f90e3e6", "algo": "sha1" }, { "id": "397", "file_id": "128", "hash": "c580c878", "algo": "adler32" } ]
The Representation Information for this is that the Structure is JSON while the Semantics are provided in the LABDRIVE documentation.
For Provenance the list of events should be included, and example of which is:
[ { "container_id": 12, "file_id": 128, "user_id": 0, "module": "FILE.CREATE", "action": "event.init", "message": "", "level": "INFO", "timestamp": "2022-04-05T13:26:43.4396616Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Safebox", "action": "file.hash", "message": "File hashed: \n [
] File path: 12/OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] File hash: a9ddcea04b67c4b635b9a0504e6fa3ff\n [
] Hash algo: md5", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:43.5880973Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Safebox", "action": "file.hash", "message": "File hashed: \n [
] File path: 12/OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] File hash: a5597f04de027b7b029ea80c86c0302a2f90e3e6\n [
] Hash algo: sha1", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:43.6233936Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Safebox", "action": "file.hash", "message": "File hashed: \n [
] File path: 12/OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] File hash: c580c878\n [
] Hash algo: adler32", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:43.6685053Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Siegfried", "action": "file.identify", "message": "File identified:\n [
] File path: /OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] Format: x-fmt/383\n [
] Mime: application/fits", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:43.9631471Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Apache Tika", "action": "file.rip", "message":
"File not ripped: /OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\nSystem.Net.WebException: Connection refused Connection refused\n ---> System.Net.Http.HttpRequestException: Connection refused\n ---> System.Net.Sockets.SocketException (111): Connection refused\n at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)\n --- End of inner exception stack trace ---\n at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)\n at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)\n at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean
doRequestAuth, CancellationToken cancellationToken)\n at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n at System.Net.HttpWebRequest.SendRequest()\n at System.Net.HttpWebRequest.GetResponse()\n --- End of inner exception stack trace ---\n at System.Net.HttpWebRequest.GetResponse()\n at TikaSharp.APIEngine.__ProcessS3FileUrl(String inputFile, String Segment, String ResponseFormat, HttpMethod Method)\n at TikaSharp.APIEngine.TikaS3(String InputFile, Boolean Sanitize)\n at SafeboxWonderlord.Action.Rip(Storage DBStorage, Container DBContainer, File DBFile, String FullPath)", "level": "ERROR", "timestamp": "2022-04-05T13:26:44.2014849Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "FILE.CREATE", "action": "event.complete", "message": "", "level": "INFO", "timestamp": "2022-04-05T13:26:44.2160709Z" }, { "container_id": 12, "file_id": 128,
"user_id": 0, "module": "FILE.CREATE", "action": "event.init", "message": "", "level": "INFO", "timestamp": "2022-04-05T13:26:44.2898057Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Safebox", "action": "file.hash", "message": "File hashed: \n [
] File path: 12/OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] File hash: a9ddcea04b67c4b635b9a0504e6fa3ff\n [
] Hash algo: md5", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:44.9420513Z" }, { "container_id": 12, "file_id": 128, "user_id": 0, "module": "Safebox", "action": "file.hash", "message": "File hashed: \n [
] File path: 12/OAIS test folder/MAGIC_2019_GRB190114C_mw.fits\n [
] File hash: a5597f04de027b7b029ea80c86c0302a2f90e3e6\n [
] Hash algo: sha1", "level": "SUCCESS", "timestamp": "2022-04-05T13:26:44.9684330Z" } ]
The Representation Information for these items is that the Structure is JSON while the Semantics are provided in the LABDRIVE documentation.
The AIP must include these additional Fixity and Provenence entries.
else
if the folder has that component then
use or add that
else
if there are parent folders then
check those iteratively and add the information
else
if the container has the schema component then
add that information
else
no such metadata
If all the data in a folder has the same Representation Information then the schema elements structureRepInfo, semanticRepInfo and otherRepInfo may be specified just once for the folder.
Provenance (the provenanceOAIS schema element) is different in that the Provenance of the parent objects could be added e.g. the container may tell us everything inside is part of the MAGIC project, with associated high level Provenance. Depending on how the data is arranged, the folder may give us the date, location, observatory staff and associated weather conditions, and the file Provenance may give us some specific information about who was the observer. The file events can then give us more details about what happened inside LABDRIVE. The MAGIC project can arrange the information as they like but this would have the advantage of avoiding duplication of information.
OAIS defines two sub-types of Archival Information Packages. The first is the Archival Information Unit (AIU) which contains essentially one object to be preserved. The example above was the specific type of AIP which has a single target of preservation (the Content Information), in other words it is an AIU. The second type of AIP is the Archival Information Collection (AIC) which contains multiple individual AIPs. For example an AIC could be made up of all the AIPs in a LABDRIVE container. The AIC has a additional piece of metadata which describes the whole collection. However the Representation Information may be the same for all the Content Information of each component AIP. This means that one only needs a single copy of the Representation Information which applies to all the components, thereby avoiding duplication. The same may also be true for Context, Access Rights and the general description of Fixity Information. Each component AIP will have its own hashes and Reference Information.
As noted previously, these responsibilities apply to the organisation and not the software. However, one can describe what the software solution should support in order to enable the archive to meet its responsibilities
1. Negotiate for and accept appropriate information from information Producers.
1.1. LABDRIVE is able to check the SIPs to ensure that they are what is expected and have not been corrupted, having been defined to ensure the AIPs can be created. It can also allow the archive staff to add additional metadata to the packages.
1.2. LABDRIVE has automated workflows including automated collection of “metadata” of the various types defined by OAIS. Additionally, if the SIPs include, for example, Provenance Information then there can be adequate Representation Information for the way it is encoded. The PREMIS standard is widely used in some domains; in this case the Representation Information would be the PREMIS standard as well as the specific vocabulary used. Other domains use other Provenance encodings, even “home-grown” systems, all of which would require their own Representation Information. LABDRIVE can support all these.
2. Obtain sufficient control of the information provided to the level needed to ensure Long Term Preservation.
LABDRIVE can preserve the proof of control e.g. to make copies. It also has extensive support for restrictions on access with configurable authentication and authorization capabilities.
3. Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided, thereby defining its Knowledge Base.
LABDRIVE allows one to identify which Designated Community applies to which object being preserved by adding appropriate schema elements.
4. Ensure that the information to be preserved is Independently Understandable to the Designated Community. In particular, the Designated Community should be able to understand the information without needing special resources such as the assistance of the experts who produced the information.
LABDRIVE allows the repository to maintain as much of the Representation Information Network as required, including identifying types of Representation Information and links between them, and allow staff to add Representation Information as required. The platform offers functionality to create/collect and link Representation Information required Designated Community, including those which are human-actionable as well as machine-actionable.
The objects being preserved may need to be Transformed to alternative formats. LABDRIVE allows processing schemes to be set up and applied to datasets. Appropriate Representation Information and Preservation Description Information can be added; Provenance events, Fixity checks and Reference Information will be created automatically.
5. Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, including the demise of the Archive, ensuring that it is never deleted unless allowed as part of an approved strategy. There should be no ad-hoc deletions.
LABDRIVE can be configured to keep as many backup copies, distributed geographically and over different technologies, as desired, with periodic fixity checks. Deletion policies can be configured, with multiple authorizations required.
LABDRIVE maintains all the Information Objects’ types, defined by OAIS, of metadata, with interfaces to add, edit, import, export or search it.
LABDRIVE supports the handover of all the information, in particular complete AIPs, to another repository in such a way that the other repository can extract the components of the AIPs are required.
Day-to-day administration is well supported as well as decision support for Management in terms of Preservation Strategies to configure the Archive, taking into account costs, both in terms of financial resources as well as environmental burden, and risks.
6. Make the preserved information available to the Designated Community and enable the information to be disseminated as copies of, or as traceable to, the original submitted Data Objects with evidence supporting its Authenticity.
LABDRIVE is able to construct DIPs in a flexible way to support changing demands of all types of consumers, as well as members of the Designated Community. In addition to specific interfaces such as Web GUIs, a general API to query and access holdings allows users to create their own applications.
Provenance Information, to support claims of Authenticity, can be provided, ranging from the origins and previous custodians of the preserved objects as well as detailed events within the repository.
OAIS identifies three basic preservation strategies as time passes and the knowledge base of the Designated Community, including hardware, software, tacit knowledge, changes. It is assumed that the bits of the digital objects will be kept safe, using the capabilities of LABDRIVE.
The three options may be identified as follows. These are discussed in more details in Preservation Activitiesand so these are described only briefly here.
The Data Object of the Information being preserved may be
kept by the Archive unchanged; or
kept by the Archive but may be changed; or
not kept by the Archive, but instead handed on to another Archive.
Each of these three imply the following:
In case 1) the archive may add Representation Information to ensure the Content Information is Independently Understandable.
In case 2) the archive may Transform the Data Object of the Information being preserved.
In case 3) the archive may hand over the AIP which contains the Object being preserved.
For each of these approaches there will be the need to ensure that an Information Object being preserved continues to be Independently Understandable by the Designated Community, the components of its AIP are not lost and are updated appropriately.
If members of the Designated Community (DC) are no longer able to understand the Information being preserved because, for example, they no longer have access to the software required, the Representation Information Network (RIN) can be extended by adding software including emulators, software containers, etc. - see Adding Representation Information .
LABDRIVE allows one to add extra Representation Information of any type.
If it is not practical, for some reason, for the repository, to add the required Representation Information, then the Data Object can be Transformed to another format. For example if Word is no longer available, and the Word software cannot be provided as Representation Information, then the Word file can be Transformed to an Open Office format.
When one transforms the Data Object of the information to be preserved an important consideration is how the repository can say with confidence that the new version may be regarded as an Authentic replacement for the original. The hash value of the new Data Object will of course be different from the original hashes and so this chain of evidence will be cut.
Transformations are likely to lose information, unless it it possible to exactly recover the original file. How can one check that the Transformation is adequate i.e. has not lost important information? For example Transforming a FITS file to a JPEG file is likely to lose the FITS headers. Bearing in mind that a FITS file may contain multiple images and tables, these will also be lost.
OAIS defines the term Transformational Information Properties which the repository should capture and check that these are unchanged after the Transformation. Examples could include the number of significant digits in the data, special flags values, headers etc.
For further information see Transforming the Digital Object.
LABDRIVE allows one to Transform Data Object using any methods necessary, and the Transformation Information Properties can be checked using scripts, supplemented by manual checks.
The Archival Information Package (AIP) for the Information to be preserved, must contain all the components needed for preservation - see Handing over to another archive.
LABDRIVE enables one to create complete AIP, as discussed above, and export these as BAGIT files or other package formats.