Benefits of preserving research data

By: David Giaretta (head of the OAIS and ISO 16363 working group)

Research data takes many forms, including numbers and characters in tables, images, XML files to discipline specific file formats, as well as documents, publications, instrument designs, and many others. What needs to be preserved is the information encoded in that data.

The preservation of each type brings its own set of benefits for a variety of stakeholders.

Because research data is itself so diverse it is sensible to not try to restrict the types of information being considered, indeed it has been said that one person’s digital trash is another person’s digital treasure.

The following sections discusses the benefits of preservation from different viewpoints, with relevant examples of types of information, benefits and motivations.

Stakeholders

  1. Governments

    1. Information of strategic value must be preserved, for example underground infrastructure such as pipes and communications conduits.

    2. Information of national pride must be preserved, such as data which has been uniquely difficult to collect such as that from space missions.

    3. Information on which policies are based should be preserved, for example historical land use and pollution, demographic trends or results of previous policies.

  2. Multinational organisations

    1. Multilateral agreements, such as precise locations of borders, must be preserved

  3. Citizens

    1. Information needed to hold the government to account must be preserved

    2. Information of long term interest to the public should be preserved

  4. Strategic Management and Funders

    1. Information is costly to create/gather; it should be preserved to ensure it is not lost in order to ensure it does not have to be created/gathered again at similar cost.

    2. Information is valuable so preserving its usefulness allows more value to be extracted

    3. Some information, e.g. measurements of climate, cannot be re-created so must be preserved for longitudinal studies

    4. Some information is too costly to re-create so must be preserved

    5. Some information must be preserved for legal reasons

    6. Digitally signed contracts for long term agreements must be preserved

  5. Tactical Management

    1. Information is fragile so steps must be taken to preserve it even over relatively short timescales in order to allow time to make a decision as to whether to preserve it over the longer term, or potentially forever.

  6. As an investment, with increasing value

    1. Information is valuable as long as it is usable despite changes e.g. in technology

    2. Information can be combined together to become more valuable so preservation must enable this

Benefits of trustworthy preservation

Researchers and practitioners from any discipline are able to find, access and process the data they need. They can be confident in their ability to use and understand data and they can evaluate the degree to which the data can be trusted.

Producers of data benefit from opening it to broad access and prefer to deposit their data with confidence in reliable repositories.

Funding bodies have confidence that their investments in research are paying back extra dividends to society, through increased use and re-use of publicly generated data.

The innovative power of industry and enterprise can be harnessed by clear and efficient arrangements for exchange of data between private and public sectors allowing appropriate returns for both.

The public can access and make creative use of the huge amount of data available; it can also contribute to the data store and enrich it. All can be adequately educated and prepared to benefit from this abundance of information.

Policy makers can make decisions based on solid evidence, and can monitor the impacts of these decisions. Government can become more trustworthy.

Types of Information

It is useful to classify information in a limited number of broad types, although there will be grey areas.

Rendered Information

Some information is “normally” rendered for example documents, images, sound recordings and videos. The specific limitation of rendered information is that normally there must be a human mind to understand, appreciate or otherwise gain benefit from the information. A human mind can take in a limited number of inputs at one time. Eventually a gestalt may be reached combining information from many sources. In the future one may expect developments of AI to blur this picture.

This type of information is not necessarily rendered, for example textual analysis of documents treats documents as data, or images as segments of colour; AI software presumably does this. Nevertheless, this distinction is useful when discussing benefits, and the challenges of preservation.

Preservation of such digital objects is often judged on the basis of whether they may be rendered in the future, for example digital documents may be printed and rea in future. The benefits of preservation therefore have some overlaps with those of “born-physical” objects, with the difference that digital objects are much more easily duplicated/copied.

Cultural benefits

Digital objects may be created from born-physical information by some type of digitisation. Benefits arise when the original physical object:

1) is destroyed, such as the Bamiyan Buddha, leaving only a digital re-construction which must be preserved.

2) is degraded, for example historical sites may be worn down by visitors’ feet, in which case virtual visits help to reduce physical visits.

3) Is fragile or inaccessible, such as unique documents or paintings which may be seem and examined in detail remotely.

Benefits from preserving born-digital objects arise in various ways including the following.

1) Websites change and often cease being available, yet they may be important as historical and cultural items. For example websites and advertisements during election campaigns are important sources for researchers.

Global news: use of social media is becoming an important source of news ... but a study - see http://arxiv.org/abs/1209.3026 showed that 11 per cent of the social media content had disappeared within a year and 27 percent within 2 years Even the websites of major corporations that should know better — including Adobe, IBM, and Intel (http://linktiger.com/broken-link-stats.php ) — can be littered with broken links.

Avoidance of legal penalties

In many domains, including healthcare, medical records, construction, manufacturing, financial records and human resources, there are legal requirements for the information to be preserved (see https://www.project-consult.de/files/Iron%20Mountain%20Guide%202013%20European%20Retention%20Periods.pdf) . The time period specified varies from country to country and ranges from a few years to, in the case of some medical information, the person’s lifetime plus 70 years.

Failure to preserve the information for at least the specified time could result in criminal prosecution.

The examples in the table below provide an idea about legal requirements on retention. Note that the information is not restricted to “rendered” digital objects.

:

Sector

Compliance

Cultural and creative sector

20-50Years for music, prototypes and designs,

+100 for long tail (e.g. film, cultural heritage)

Energy and Utilities

Permanent retention

3-20Y Copies of waste management

30Y Documents containing audits on radioactivity and results measurement

10Y data regarding chemicals or

10Y Metering database

Healthcare

Patient lifetime

Hospital safety records (i.e. incidents) 7

X ray 30 years

Ultrasound records (e.g. vascular,

Post mortem Registers 30 years

Manufacturing

+50 years for design

Automotive: +15 years for vehicles sold

Aerospace: +50 Y

Discrete manufacturing: 15

Chemical and process manufacturing: safety 20

Food: 1-30Y (safety)

Evidence to support legal claims

Evidence in the form of rendered objects including digitised documents and born-digital images can form important pieces of legal evidence.

US Supreme Court: Between 1996 and 2013 justices have cited materials found on the Internet 555 times ... but half of the links in all Supreme Court opinions no longer work (http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/)

Financial benefits

Preserving digital objects which are normally rendered are most obvious for copyright materials, including pay per use of:

· Movies

· Audio recordings

· Books

Non-Fungible Tokens (NFTs) are now also being used to monetise these objects.

Non-rendered Information

Non rendered information, which includes most scientific data, are not normally printed/displayed but instead are computer processed either as single objects or in combination with other similar objects, to produce new information. Any of these objects may of course be rendered, for example as a graph or image, but any particular rendering only shows limited aspects of the data.

Non-rendered objects may be grouped in various ways. One such grouping is as follows.

Observations – non reproducible

Observations of natural events and objects in the universe yield data which depends on the time the observation is made, for example air temperature or pressure at a particular place, meteor trails in the sky, the spectrum and brightness of certain stars, wildlife migrations all spend on time and repeated observations will produce different data.

The value arising from preserving such data comes from longitudinal studies include those which reveal existential issues such as climate change by combining multiple sets of observations.

Non-reproducible studies also may be used as evidence of arrival times for modes of transport, vehicle speeds, distribution of specific strains of virus, and many other things, all of which may produce specific financial, personal or societal benefits.

Combining observations can create value, for example comparing demographic patterns with

Experiments – reproducible results

Data which is collected using controlled experimental setups are often expected to be reproducible, within statistical errors, or systematic differences.

Such data may in principle simply be measured again but there are many reasons for preserving a set of measurements in order to, for example,

- compare to new sets of measurements in order to check that the measurement is reproducible

- use the measurement in future to avoid expending resources needed to repeat the measurement

- increase the value of data by combining with other data sets, for example pharmaceutical properties against diffusion rates or gene mutations with eye colour.

Scientific data, one of the foundations of the scientific method is the reproducibility of results ... but a survey (see http://journalistsresource.org/studies/society/internet/website-linking-best-practices-media-online-publishers ) found the median lifespan of links in the scientific literature was 9.3 years, and just 62% were archived. A survey (see - http://www.smithsonianmag.com/science-nature/the-vast-majority-of-raw-data-from-old-scientific-studies-may-now-be-missing-180948067/?no-ist ) of 20-year-old studies shows that poor recordkeeping and inaccessible authors make 90 percent of raw data impossible to find.

Last updated