A warm thanks to Mogens and Andrew from the 'Deposit' - use cases for the excellent structure. We directly copied it here ...
Use cases
- Scientific Open Access Search Engine: A lecturer preparing materials for a seminar in Machine Learning wants to compile for the students a collection of relevant articles, images, primary data and software. Because national copyright regulations force him to report every single use of licensed material to his university, he decides to only use public domain (PD) and open access (OA) material for his seminar. Thus, in order to collect his materials, he wants to use a single search engine that only returns search results that are scientific as well as PD or OA.
- Embedded Data: An instructor teaching a university astronomy course wants her students to read some papers about latest observations of the Omega Nebula and manipulate some of the observational data. To make manipulation easier she wants the data represented in the VOTable format, developed by the National Virtual Observatory. She therefore wants to find and access papers about the Omega Nebula in repositories that have embedded data sets in this format, and also get access to that data.
- Text-Mining for Literature Review: A post-doc from an Ecosystems Biology department is responsible for compiling a list of the 20 most relevant new publications for the weekly literature discussion meeting in the department. The department performs research on the population dynamics of lemmings based on diverse research fields such as genetics, evolutionary algorithms, behavioural studies, radio signal tracking etc. Weekly new publications in all these areas amount to 5000 exemplars, so she uses text-mining techniques to generate the list. Experience shows her that only a mining method using a full-text corpus combined with classification algorithms that are based on controlled document metadata does the job well enough.
- Scientometrics: An Information Science Ph.D. student is investigating the nature of communities in different scientific disciplines (e.g., how co-authorship, citation patterns correlate with social links such as Facebook, academic links, etc.). To do this she wants to access papers in repositories related to the target disciplines, extract authorship metadata and citations, and then populate a semantic relationship graph using data from social networking sites, academic web pages, online CVs, etc.
Components
Which infrastructure components will be needed to make this work effectively? (NB- I've started to link to some of the briefing materials from here- NJ)
- The Web
- Repositories (software / support organisations )
- storage and access to multiple types of resources (text, primary and secondary data, images, video, software, etc.)
- availability of metadata (e.g., via OAI-PMH) that includes machine-readable rights
- representation of object structure and linkages (OAI-ORE, SWAP)
- plug-in service interface (ability to manipulate contents through multiple services), including common text mining API?
- Registries
- Semantic services
- representation of, manipulation of, and querying of cross-repository object linkages
- machine-understandable licences associated with each object, implying common agreement on semantics across jurisdictions, including the meaning of 'attribution' in a text-mining context
- Social network sites
- open access and APIs to embedded information
- Open citation data?
Technologies
- Repository platforms with open-access APIs
- Web architecture
- OAI-PMH, RSS, ATOM
- OAI-ORE, metadata profiles (eg, SWAP)
- Services (RESTful, XML-RPC, SOAP)
(specific software packages such as search engines, mining tools etc. are not considered here)
Action Plan Development
We would now like your input to start developing the Action Plan for this area ahead of the workshop. Specifically, we would like you to:
- See if you think these are the most helpful use cases
- Please bear in mind that we need to be focussed and can't just multiply use cases indefinitely - we need to work with use cases that relate to what our users want to do, and that will tease out the necessary technology components.
- Edit the use case descriptions to make sure they capture the core of the issue
- the temptation here is to either try for the too-general or the too-specific; try to tread a middle line!
- Start to identify the components that might be needed for each use case
- It would be useful if we could start to converge on a small set of such components, so it might be a good idea to have a look at the other themes as they develop to see if they have identified possible useful components
Carl Lagoze and Wolfram Horstmann .
-----------
Comments pasted from previous wiki instance:
26 Dec
Alma Swan says:
I am familiar with text-mining developments in the UK, and will be trying to put together the fuller picture showing text-mining initiatives around the world. Does anyone know anything about text-mining technologies in regions/countries with non-Latin alphabet-based languages - e.g. Cyrillic, Arabic, Japanese, Chinese, etc?
05 Feb
James Farnhill says:
Alma, whilst I can't point you to specific initiatives I know that the National Centre for Text Mining at Manchester has strong links with the Tsujii Labs based in the University of Tokyo so may be able to point you in the right direction. Sophia Ananiadou (sophia.ananiadou@manchester.ac.uk) would be a good first contact. I hope that is helpful.
06 Feb
Alma Swan says:
Excellent. Thanks, James.
11 Feb
Brian Rea says:
Alma, drop me a line, I'll arrange a call with Sophia and/or Prof Tsujii, who would further discuss some of the work on text mining in Japanese/Chinese.
10 Feb
Keith G Jeffery says:
Use case 1: here it is very important to have descriptive metadata, restriuctive (rights) metadata and conbtextual metadata - the latter to assist in evaluating the potential use of the object(s) for the purpose. The metadata needs to have formal syntax and defined semantics for processing to be automated.
Use case 2: there is a question about data in papers and data stored as data. There has been much work in the chemnistry domain in Fraunhofer over 'scraping' date from tables and graphs in papers. Personally I believe it is better to store the raw data inm repositories (separte from the publication repositories because of different access patterns, rights, performance, security.. and much more extensive metadat required (and access to software) for handling research datasets. STFC has proposed an appropriate format and has demonstrated it (also linking datasets to papers in separate repositories). See http://epubs.cclrc.ac.uk/bitstream/744/05__Brian_Matthews___csmd_core_grid_poland.pdf\\
Use case 3: all experience says that it is necessary to combine full text corpus with classificatory information and metadata. However, the metadata must include supportive associative metadata namely lexicons, dictionaries, thesauri, ontologies - and all multilingual - to allow search optinisation for recall and relevance. Mining for multimedia is MUCH more difficult (video where Shakesperean actor exits stage left after killing someone) similarly audio (phrases in Mozart which follow patterns in Bach)
Use case 4: see IST-WORLD project http://www.ist-world.org/ \- it shows what can be done using CERIF (www.eurocris.org/cerif )
16 Feb
Andrew Treloar says:
Re Use-case 2, I agree that our ability to access data is constrained by a range of things. See the Data Problems in Published Literature section in http://ands.org.au/prdla2008keynote.pptx.pdf for one typology..
11 Feb
leo waaijers says:
Since I am working outside the academic domain now, I am exemplary for usecase 1. In more than 50% of the cases that I want to read an article I first have to produce my credit card number and agree that my bank account will be debited with an amount of Euro 25 or so; the same amount that I pay in the book shop for a complete novel! I would be greatly helped by an Open Access or Public Domain subset of Google Scholar, something like 'Google Scholar Open' or 'Google Scholar/Creative Commons'. Google's browsing is sophisticated, the bibliographic details are good enough and the citations are there. Can't we urge Google to create something like that. It wouldn't be too difficult for them as they are able to index (parts of) the full text as well. And it might be an extra incentive for creators and funders to make their stuff openly accessible.
11 Feb
Wolfram Horstmann says:
Leo, I couldn't agree more -- and I have been asking myself for a long time why they don't offer an "open-full-text" search option (maybe even in Google rather than in Google Scholar because the coverage in the latter is much worse than in the former) that is comparable to the Creative-Commons-Search for images available in 'flickr'. Having said this, I would like to add that a systemic approach based on structured data in repositories (e.g. an OAI-PMH "full-text-set" as proposed in the DRIVER-Guidelines combined with a clear URI policy for referencing the full-text) should be the superior solution on the long run. But this is long-term, technical policy work. And another question relevant in this context: "Why aren't there robust and reliable algorithmic solutions available in the repository domain that sort out the full-text URIs from PMH-records and splash-pages (to separate them from metadata-only records)? Yes, there are a lot of problems -- but the problems are known for years now!? Is it really so hard to do this so that there is still no "shelf" with solutions for indexing initiatives?
11 Feb
Brian Rea says:
Use case 1: See the INTUTE Repository Search Project - Aggregates metadata from UK open access repositories, extracts and indexes full texts (where possible), automatically generates new metadata based on full text content and then allows users to search all repositories from a single interface.
Use Case 2:
* Why only link to the paper, suitable markup and indexing [at which point in the repository life cycle?] can take users to relevant sections discussing the data
* What if article X brings together data in articles X, Y and Z - will ORE style containers help here and who defines the links in the first instance.
* Name Authority is required here not just for authors but also entities - see KLEIO - named entity recognition, term normalization and linking to community defined unique identifiers
Use Case 3:
* Scalability can be found in text mining over manual solutions but accuracy in the results is clearly dependent on user query definition (assisted?), metadata available and ranking algorithms. Mapping between query and metadata can be challenging.
* Whilst controlled vocabularies offers a good solution for search/browse these are often incomplete, domain specific and unless carefully constructed can often fail to match language used by readers.
* Manually curated metadata can be subjective, incomplete and generally does not cover every possible use of the information held in the object.
* Automated metadata creation can assist here but will have different results at the various stages in the repository lifecycle:
o Assisted authorship tools
o Social networking aspects
o Unsupervised automatic curation after deposit
o Runtime annotation based on context of search
* Semantics can shift between domains and over time (preservation issues) . Mapping is required between domains and even across metadata standards, this needs to be updated over time [by whom author/curator/owner/institution/...?]
* User agents can be of use for repetitive searches
Use case 4: See BioMedExperts but bear in mind issues above for use of controlled vocabularies.
11 Feb
Arunachalam Subbiah says:
Regarding Scientific Open Access Search Engine: I draw your attention to Open J-Gate
<www.openj-gate.com>.
Open J-Gate is an electronic gateway to global journal literature in open access domain. Launched in 2006, Open J-Gate is the contribution of Informatics (India) Ltd to promote OAI. Open J-Gate provides seamless access to millions of journal articles available online. Open J-Gate is also a database of journal literature, indexed from 4784 open access journals (including both peer-reviewed journals and trade magazines), with links to full text at Publisher sites.
13 Feb
SUGITA Shigeki says:
How about metadata standardization for harvesting regionally (or globally)? DRIVER guidelines, SWAP, ..?
In Japan, we use 'junii2' schema as OAI-PMH metadataFormat, which is stipulated and maintaiined by the National Institute of Informatics(NII) and used by the nationwide harvester, 'JAIRO.'
Among the features of junii2 are (1) controlled vocaburary for resource types and version information (unfortunately, not complient to VIF or other standards) and (2) elements for OpenURL-compliant bibliographic citation.
The aforementioned JAIRO takes advantage of the junii2, and additionally 'AIRway' project (http://airway.lib.hokudai.ac.jp) also benefits from the standard. AIRway is a knowledgebase for link resolvers (like as CrossRef), and guide non-licensed users to open access copies like self-archived authors' drafts on IRs.
In abstract, AIRway is a database which bundles plural versions of research papers. It's just a quick fix based on existing technology such as OpenURL, and I hope the same goal will be achieved with OAI-ORE and maybe with persistent identifiers for repository items.
15 Feb
Tom Baker says:
I agree with Carl and Wolfram when they say (at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0901&L=REPOSITORIES-INFRASTRUCTURE&P=72): "we envision an access infrastructure where the boundaries between institutional repositories, commercial web sites (e.g. Amazon), and social networking sites are transparent, allowing researchers to explore the relationships between publications, data, and individuals."
All of these things are "on the Web", so I have added "The Web" to the list of infrastructure components above. Maybe this just goes without saying but I think it is worth emphasizing here.
For the Use Cases presented above, are we to assume that information seekers are limiting their searches to a bounded world of repositories? If so, then the challenge is merely to create services and interfaces that will support these requirements. If not, then it means creating repositories that are grounded in URI space and play well in the Web.
I agree that metadata needs to have a formal syntax and defined semantics for processing to be automated but would add that for use outside a bounded world of repositories, metadata should be expressed (or expressible) in the language of URIs and triples on the basis of Web architecture.
15 Feb
Jeremy Frumkin says:
I am wondering if governance / organization should be included somewhere here. What is needed to include a collection / repository within the scope of an international federation of repositories or digital libraries? While this isn't solely an access issue, it relates to Tom's comment above, and how a user's discovery experience is scoped based on their need. It seems that while it is important to break down the artificial barriers between digital libraries and the rest of the web, there is still the need to be able to retain context, at the very least for scoping purposes.
16 Feb
leo waaijers says:
Why should authors switch from their current practice of writing an article that refers to all sorts of additional material (research data, algorithms, visualisations, references, etc.) via hyperlinks to defining a Resource Map in the sense of OAI ORE? No doubt, a Resource Map with well defined triples is more sophisticated and better prepared for the semantic web then a hyperlinked piece of text. But it also more laborious and complex to produce. What then are incentives for authors to make the step; what makes them tick?
Comments (0)
You don't have permission to comment on this page.