Preservation actions


Preservation actions

A researcher wishes to read a conventional research paper deposited 15 years ago into a repository as a file format that is no longer in common use, and for which she has no reader easily available.  She is able to open it and read it without any trouble, alongside more contemporary papers.  (Repository managers need to ensure that research papers held in repositories are useable into the future.  The DCC digital lifecycle model gives a good overview of what this may entail , the specific focus here being 'preservation actions' .  This is likely to require components at the local, national and international level.).

Possible components include:

 - file format registries

 - validation tools

 - representation information registries

 - etc

Feel free to edit or remove this, but here's the DCC lifecycle model in an editable form, which might be a useful structure around which to build the discussion - in particular to name some of the key components, development of which either is being, or might be, coordinated internationally:

---

The Curation Lifecycle

The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to

ensure that all necessary stages are undertaken, each in the correct sequence. The model enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement. It can help with

the process of identifying additional steps which may be required, or actions which are not required by certain situations or disciplines, and ensuring that processes and policies are adequately documented.

Data (Digital Objects or Databases)

Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes:

Digital Objects:

- Simple Digital Objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata.

- Complex Digital Objects are discrete digital objects, made by combining a number of other digital objects, such as websites.

Databases -Structured collections of records or data stored in a computer system

Full Lifecycle Actions

Description and Representation Information: Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand

and render both the digital material and the associated metadata

Preservation Planning: Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions.

Community Watch and Participation: Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software.

Curate and Preserve: Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle.

Sequential Actions

Conceptualise: Conceive and plan the creation of data, including capture method and storage options.

Create or Receive: Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation.  Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.

Appraise and Select: Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.

Ingest: Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements.

Preservation Action: Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning,

validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.

Store: Store the data in a secure manner adhering to relevant standards.

Access, Use and Reuse: Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information.  Robust access controls and authentication procedures may be applicable.

Transform: Create new data from the original, for example

- By migration into a different format.

- By creating a subset, by selection or query, to create newly derived results, perhaps for publication.

Occasional Actions

Dispose: Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or  other custodian. In some instances data is destroyed. The data's nature may, for legal reasons, necessitate secure destruction.

Reappraise: Return data which fails validation procedures for further appraisal and reselection.

Migrate: Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data's immunity from hardware or software obsolescence.

 

Comments from previous instance of wiki

 

 

26 Dec

Alma Swan says:

Was the effort to provide 'preservation actions', i.e. whatever it took to enable the researcher to open and read the 15-year-old paper without any difficulty despite not having a reader available, most likely to have been made at local, consortial or national (or even international) level? Which is working in practice?

 

 

      11 Feb

      Michael Day says:

      In practice, it has tended to be the larger national organisations that have begun to take responsibility for developing the main components of preservation infrastructures, e.g. registries, tools, and services. This is often because such organisations (e.g. national libraries or archives, major research libraries) have either made significant investments in purchasing (or producing) digital content themselves, or they have legislative mandates that mean that they have to engage with longer-term preservation challenges. It remains unclear how smaller organisations might engage with this wider infrastructure. One model might be for national institutions to harvest or otherwise replicate smaller collections and integrate them within their own preservation frameworks. As pointed out elsewhere on this page, this is essentially the model used by KB for dealing with the content of Dutch repositories and the collections of certain journal publishers. The main advantages of this 'top-down' approach are the increased level of control that can be enforced over content (e.g. through migration to a limited number of 'preservation-friendly' formats) and the potential for the adoption of multiple preservation strategies. Disadvantages include the lack of a robust economic model - the national-level organisations take on all of the risks and expense of long-term preservation so that the content providers don't have to - and the likelihood that the system will not work in quite the same way for more complex kinds of content, e.g. research data. The challenge will be to build an infrastrucuture that will enable local institutions to engage directly with preservation needs themselves. In attempting this, the most significant problems are likely to be organisational rather than technical, e.g. the building of intra-organisational trust. So, for example, if I build a preservation service completely dependent on someone else's file-format registry or obsolescence indicator tool, this may import a significant level of risk into my own operations.

 

07 Jan

Chris Rusbridge says:

Last year the JISC RPAG group used Ideascale to assess a number of "ideas" relating to repositories, preparatory to some work that Rachel Heery was doing. You can still see the results (which may still be active, I'm not sure) at http://jiscrepository.ideascale.com/akira/panel.do?id=784. In general, I think these ideas and the comments attached are useful inputs to a process like this one.

 

I put forward a set of ideas based on blog posts I did mid last year on "negative click" research support repositories, and it was interesting to see how those ideas rose or fell. One of the ideas relates to this use case, although it was stated more strongly as "The repository should be a full OAIS preservation system". Well, they hated it! This is the lowest ranked idea of all, 16 votes against and only 3 votes for. It appeared that repository managers (most of the audience here) do not dare to think of themselves as being responsible for preservation.

 

However, I thought to myself that they couldn't possibly really mean that. So I tried another idea: "The Repository should aspire to make contents accessible and usable over the medium term". They liked this much better, and even though it was hindered by arriving late, it still garnered net 12 votes, with 13 voting for and only one against. It came 17th out of 28 ideas, of which only one attracted more than 20 net votes.

 

What I'm trying to say here is that repository managers don't see themselves as having a long term preservation role, although they do see a near term preservation role. I'm not sure how this plays against the use case, however, but I'm new here, so I guess I'll learn!

 

 

07 Jan

Chris Rusbridge says:

Oh I meant to say that a summary of reactions to the negative click repository ideas was on the blog at http://digitalcuration.blogspot.com/2008/08/comments-on-negative-click-research.html...

 

21 Jan

Jan Hagerlid says:

Will this article be accessed from an institutional repository or from a national library, when it is 15 or 20 years old?

 

To give an example from my own setting: Shouldn't articles deposited in a Swedish institutional repository also  be delivered to the  National Library of Sweden within a framework of legal deposit that includes e-publications? We are waiting for a law stipulating legal e-deposit to be taken this year or next. So we don't as yet have a policy defining which kind of documents that should be included. But it would seem quite reasonable that any article written by a Swedish author being made public in whatever version in a (Swedish) repository should be included.

 

One component that we have in place is a resolution service using a persistent identifier, in our case a URN. This means that a user gets directed to the repository now, but at a future date - when the repository is closed down or has put a time limit to its preservation responsibility - would be directed to the copy in the national library. Also, the national library would have the responsibility for any conversion to new formats.

 

Has discussions or actions along these lines progressed further somewhere else, especially where you have a law for legal e-deposit?

 

      10 Feb

      leo waaijers says:

      Th Swedish situation is more or less similar to the Dutch one. The e-Depot of the national library harvests incrementally per quarter all the institutional repositories. Both the metadata and full text is harvested. The yield has to pass an ingest procedure which checks for open access of the documents, the quality of the metadata (i.e. grosso modo compliance with the DRIVER Guidelines; I am not sure if they use the DRIVER Validator for this test), adds a persistent identifier (URN), adds the technical metadata (format, application) and makes corrections that can be done automatically. Records with errors that cannot be corrected automatically will be rejected for depositing in the e-Depot; the relevant institute will be notified of this rejection so that they can make the necessary corrections or additions. Upon request  institutional repositories can get a download of their records and full text at any moment. In the meantime the debate between the emulation and migration school is still ongoing.

 

      The e-Depot limits itself to text, web sites and power points. Two years ago the DANS institute of the Netherlands Academy has started a long term access service research data in the humanities. Later the three technical universities joined efforts to supply a similar service for research data, models and agoritms within their area. Both services are still in a state of flux. As enhanced publication are foreseen that are compositions of both text (that might be stored in the e-Depot) and research data (that e.g. may be stored in DANS) the potentials of OAI ORE have been reckognized in an early stage and the first demonstrators have been built.  

 

      13 Feb

      SUGITA Shigeki says:

      Also in Japan is being discussed a legal framework on comprehensive archive of internet resources by National Diet Library, which will be deliberated in the Diet in 2009. It seems that it's assumed to be done by web-crawling, not by DIDL-like technology. However, we repository community in Japan anticipate it anyway.

 

10 Feb

Keith G Jeffery says:

As well as potential media migration for preservation there may be rights migration either because (a) rights expire (date) or (b) legislation or best practice changes

 

15 Feb

Tom Baker says:

Software and APIs come and go.  Given the speed of technological change, what are the chances that any of today's repository applications will still be in use 25 years from now?  When Chris reports that "repository managers don't see themselves as having a long term preservation role, although they do see a near term preservation role," maybe they are just being realistic. 

 

Whatever happens to repository applications per se, the data will (hopefully) remain, and we need to plan for that.  I think this means evaluating the usability of repository data independently of current repository services.  Do the resources being described have curated URIs?  Is the metadata expressed (or expressible) in triples?  Are the vocabularies of properties used in the metadata properly documented and are they being accessibly preserved?

 

15 Feb

Jeremy Frumkin says:

One area I am interested in, and haven't seen much discussion around, is how we decide when not to preserve a resource. I am not certain that the goal is to preserve everything in a repository (perhaps in the short term, yes, but long term?). Are there thoughts as to programmatic approaches to preservation selection? While technology may allow us to preserve items much more thoroughly, is this a better approach than to apply an intellectual selection process resembling the archival practices for traditional materials and collections?

 

 I realise that this is covered in appraise and select in the DCC Curation Lifecycle, but again, I am not aware that any of our current digital repository solutions inherently support workflows in support of this.

 

      08 Mar

      Kevin Ashley says:

      Jeremy, if we're thinking of repositories in the broad sense, then I agree with you that selection mechanisms and automated support for them are important. But there are two reasons why I don't believe that they are significant for this workshop. One is that, for the type of repositories we are considering (research papers), post-ingest deletion is a much less common event. The other is that it's the sort of thing that is local rather than global, and we want to focus on those issues that require joined-up international action or agreement.

 

      There is one area where this does become interesting, and that's in the linking of research papers to other items, such as data, which may be held in other repositories. There will be much stronger selection pressures on such data since in some fields it has a very short useful research life. But it's important to ensure that data which supports a publication is not subject to deletion in the same way that other research data might be. Or perhaps even that is something that's open to debate.

 

            11 Mar

            Richard Boulderstone says:

            Kevin, following on from your point - an area of interest for me is how to ensure that the items at the ends of links will be there when they are needed. Should we have a mechanism to test there integrity? Should we link only to items with persistent identifers? Should there be agreements created across repositories to guarentee the continued existance of items? Are research papers still valid if these links no longer work?

 

10 Mar

Chris Rusbridge says:

The scenario is a bit odd: the paper is in "a file format that is no longer in common use, and for which she has no reader easily available. She is able to open it and read it without any trouble, alongside more contemporary papers." The implication here is that she has not had to do anything, that a set of actions have occurred elsewhere, perhaps automatically, to allow her to read this paper.

 

The classic options here are the variants of emulation and migration, managed by the repository. Emulation in this case looks out of the question, as she is able to "read it without any trouble, alongside more contemporary papers". This implies she has not had to enter an emulation environment; I can't see how this could occur as seamlessly as the scenario suggests.

 

Migration is potentially less problematic, for her, if not the repository manager. The implication is that the file has been (or perhaps, automatically is) converted from the obsolete format to a then-current format. Now she can "read it without any trouble, alongside more contemporary papers". Personally, I don't think this should happen completely transparently, as there will be some risk of loss.

 

There are of course subtle variants. One that I have suggested elsewhere is to support the community as a whole in adding support for more formats into OpenOffice.org, which could then increasingly act as a "migration on request" tool. If this were the case, then her situation would not arise in the form described, as she would be able to read these obsolete document formats with an available tool. This then could be an international action: support such work in the open source community.

 

10 Mar

Chris Rusbridge says:

I wanted to comment further that the scenario is specific about being a document format. If she were interested in obsolete data, the situation might be much more complex. More information would be needed about not only the format but the context of the data. OAIS lumps this together as Representation Information, not entirely helpfully in my view. The situation is more subtle than the current draft of OAIS describes, and maybe we should be pursuing a further dialogue to analyse it more closely. Or maybe it is so situation-dependent (or discipline-dependent, or whatever) that such a dialogue would be fruitless.

 

---