Planning and Using Additional Information in LABDRIVE
Last updated
Last updated
Understanding the way in which information is collected and the groupings into which they naturally fall, mean that the way in which to structure the information in LABDRIVE can be planned. Such planning will allow in order to avoid repetition of “metadata”, which would become especially troublesome if the “metadata” is expanded or corrected.
Consider the case of a project which has a context, justification, budget and associated Provenance which applies to all information captured in that project. This project involves several instruments. Each instrument may have several modes of operations. In each mode there may be several calibrations over time, and several datasets. In addition a dataset may be analysed in combination with other data.
Instead of repeating the Provenance, Representation Information etc for each Data Object, which wastes resources and risks inconsistencies when things are updated, one can instead organise the information as follows:
Top Level Container provides the aspects of Provenance and Context which applies to all the data captured in that project.
A sub container holds information from one of the instruments, each instrument may have common Representation Information which applies to all information captured by that instrument. There may also be Provenance, such as instrument design and manufacture which applies to all such information.
Within that there is a top level folder which holds all the information captured with a certain calibration, perhaps after some instrument alterations. This may have specific Representation Information and Provenance which applies to all the information captured with these settings.
A sub-folder collects all the information captured in one campaign and has Representation Information and Provenance which applies to all information within that folder.
Within that sub-folder a specific file may have its own specific Representation Information and Provenance.
Another container may hold a folder with he results of an analysis which combined information from several sources. This is especially important for scientific data where a processing workflow can sometimes be linear, where dataset1 is analysed to produce dataset2, but very often the processing involves combining different datasets, for example {dataset1, dataset2, dataset3....datasetn} are combined/analysed to produce dataset-x. In order to reproduce the processing one would need to be able to access Provenance of all the components.
The Representation Information and Provenance for a specific file will be the accumulation of each of those from the higher levels.
In the case of combined data, the Provenance would include the details of the processing, to support reproducibility of the results, as well as the Provenance of the contributing datasets. The precise inputs, operations and parameters used can be recorded in very many ways, for example in dataafile headers, as with FITS files, or in separate scripts or even hard coded in software, using naming/location conventions which are specific to the original processing setup. If the processing is to be reproducible then these Provenance details must be captured, and must be preservable in the archive. LABDRIVE allows this to be done by allowing any object, whether scientific data or provenance, to be preserved with its associated Representation Information.
By organizing the data in this way allows Provenance and Representation Information to be shared in a way which minimizes duplication and risks of inconsistencies.
Rather than relying on structuring the holdings as described above, for example if the structure is determined by other considerations, the same effect can be achieved by explicitly adding pointers to the metadata fields. For example the Provenance metadata field may include a pointer to the "parent" provenance which is shared by other objects. The ingestion process could handle the automated insertion of these pointers.