Which infrastructure components will be needed to make this work effectively? (NB- I've started to link to some of the briefing materials from here- NJ)
We would now like your input to start developing the Action Plan for this area ahead of the workshop. Specifically, we would like you to:
Carl Lagoze and Wolfram Horstmann .
-----------
26 Dec
Alma Swan says:
I am familiar with text-mining developments in the UK, and will be trying to put together the fuller picture showing text-mining initiatives around the world. Does anyone know anything about text-mining technologies in regions/countries with non-Latin alphabet-based languages - e.g. Cyrillic, Arabic, Japanese, Chinese, etc?
05 Feb
James Farnhill says:
Alma, whilst I can't point you to specific initiatives I know that the National Centre for Text Mining at Manchester has strong links with the Tsujii Labs based in the University of Tokyo so may be able to point you in the right direction. Sophia Ananiadou (sophia.ananiadou@manchester.ac.uk) would be a good first contact. I hope that is helpful.
06 Feb
Alma Swan says:
Excellent. Thanks, James.
11 Feb
Brian Rea says:
Alma, drop me a line, I'll arrange a call with Sophia and/or Prof Tsujii, who would further discuss some of the work on text mining in Japanese/Chinese.
10 Feb
Keith G Jeffery says:
Use case 1: here it is very important to have descriptive metadata, restriuctive (rights) metadata and conbtextual metadata - the latter to assist in evaluating the potential use of the object(s) for the purpose. The metadata needs to have formal syntax and defined semantics for processing to be automated.
Use case 2: there is a question about data in papers and data stored as data. There has been much work in the chemnistry domain in Fraunhofer over 'scraping' date from tables and graphs in papers. Personally I believe it is better to store the raw data inm repositories (separte from the publication repositories because of different access patterns, rights, performance, security.. and much more extensive metadat required (and access to software) for handling research datasets. STFC has proposed an appropriate format and has demonstrated it (also linking datasets to papers in separate repositories). See http://epubs.cclrc.ac.uk/bitstream/744/05__Brian_Matthews___csmd_core_grid_poland.pdf\\
Use case 3: all experience says that it is necessary to combine full text corpus with classificatory information and metadata. However, the metadata must include supportive associative metadata namely lexicons, dictionaries, thesauri, ontologies - and all multilingual - to allow search optinisation for recall and relevance. Mining for multimedia is MUCH more difficult (video where Shakesperean actor exits stage left after killing someone) similarly audio (phrases in Mozart which follow patterns in Bach)
Use case 4: see IST-WORLD project http://www.ist-world.org/ \- it shows what can be done using CERIF (www.eurocris.org/cerif )
16 Feb
Andrew Treloar says:
Re Use-case 2, I agree that our ability to access data is constrained by a range of things. See the Data Problems in Published Literature section in http://ands.org.au/prdla2008keynote.pptx.pdf for one typology..
11 Feb
leo waaijers says:
Since I am working outside the academic domain now, I am exemplary for usecase 1. In more than 50% of the cases that I want to read an article I first have to produce my credit card number and agree that my bank account will be debited with an amount of Euro 25 or so; the same amount that I pay in the book shop for a complete novel! I would be greatly helped by an Open Access or Public Domain subset of Google Scholar, something like 'Google Scholar Open' or 'Google Scholar/Creative Commons'. Google's browsing is sophisticated, the bibliographic details are good enough and the citations are there. Can't we urge Google to create something like that. It wouldn't be too difficult for them as they are able to index (parts of) the full text as well. And it might be an extra incentive for creators and funders to make their stuff openly accessible.
11 Feb
Wolfram Horstmann says:
Leo, I couldn't agree more -- and I have been asking myself for a long time why they don't offer an "open-full-text" search option (maybe even in Google rather than in Google Scholar because the coverage in the latter is much worse than in the former) that is comparable to the Creative-Commons-Search for images available in 'flickr'. Having said this, I would like to add that a systemic approach based on structured data in repositories (e.g. an OAI-PMH "full-text-set" as proposed in the DRIVER-Guidelines combined with a clear URI policy for referencing the full-text) should be the superior solution on the long run. But this is long-term, technical policy work. And another question relevant in this context: "Why aren't there robust and reliable algorithmic solutions available in the repository domain that sort out the full-text URIs from PMH-records and splash-pages (to separate them from metadata-only records)? Yes, there are a lot of problems -- but the problems are known for years now!? Is it really so hard to do this so that there is still no "shelf" with solutions for indexing initiatives?
11 Feb
Brian Rea says:
Use case 1: See the INTUTE Repository Search Project - Aggregates metadata from UK open access repositories, extracts and indexes full texts (where possible), automatically generates new metadata based on full text content and then allows users to search all repositories from a single interface.
Use Case 2:
* Why only link to the paper, suitable markup and indexing [at which point in the repository life cycle?] can take users to relevant sections discussing the data
* What if article X brings together data in articles X, Y and Z - will ORE style containers help here and who defines the links in the first instance.
* Name Authority is required here not just for authors but also entities - see KLEIO - named entity recognition, term normalization and linking to community defined unique identifiers
Use Case 3:
* Scalability can be found in text mining over manual solutions but accuracy in the results is clearly dependent on user query definition (assisted?), metadata available and ranking algorithms. Mapping between query and metadata can be challenging.
* Whilst controlled vocabularies offers a good solution for search/browse these are often incomplete, domain specific and unless carefully constructed can often fail to match language used by readers.
* Manually curated metadata can be subjective, incomplete and generally does not cover every possible use of the information held in the object.
* Automated metadata creation can assist here but will have different results at the various stages in the repository lifecycle:
o Assisted authorship tools
o Social networking aspects
o Unsupervised automatic curation after deposit
o Runtime annotation based on context of search
* Semantics can shift between domains and over time (preservation issues) . Mapping is required between domains and even across metadata standards, this needs to be updated over time [by whom author/curator/owner/institution/...?]
* User agents can be of use for repetitive searches
Use case 4: See BioMedExperts but bear in mind issues above for use of controlled vocabularies.
11 Feb
Arunachalam Subbiah says:
Regarding Scientific Open Access Search Engine: I draw your attention to Open J-Gate
<www.openj-gate.com>.
Open J-Gate is an electronic gateway to global journal literature in open access domain. Launched in 2006, Open J-Gate is the contribution of Informatics (India) Ltd to promote OAI. Open J-Gate provides seamless access to millions of journal articles available online. Open J-Gate is also a database of journal literature, indexed from 4784 open access journals (including both peer-reviewed journals and trade magazines), with links to full text at Publisher sites.
13 Feb
SUGITA Shigeki says:
How about metadata standardization for harvesting regionally (or globally)? DRIVER guidelines, SWAP, ..?
In Japan, we use 'junii2' schema as OAI-PMH metadataFormat, which is stipulated and maintaiined by the National Institute of Informatics(NII) and used by the nationwide harvester, 'JAIRO.'
Among the features of junii2 are (1) controlled vocaburary for resource types and version information (unfortunately, not complient to VIF or other standards) and (2) elements for OpenURL-compliant bibliographic citation.
The aforementioned JAIRO takes advantage of the junii2, and additionally 'AIRway' project (http://airway.lib.hokudai.ac.jp) also benefits from the standard. AIRway is a knowledgebase for link resolvers (like as CrossRef), and guide non-licensed users to open access copies like self-archived authors' drafts on IRs.
In abstract, AIRway is a database which bundles plural versions of research papers. It's just a quick fix based on existing technology such as OpenURL, and I hope the same goal will be achieved with OAI-ORE and maybe with persistent identifiers for repository items.
15 Feb
Tom Baker says:
I agree with Carl and Wolfram when they say (at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0901&L=REPOSITORIES-INFRASTRUCTURE&P=72): "we envision an access infrastructure where the boundaries between institutional repositories, commercial web sites (e.g. Amazon), and social networking sites are transparent, allowing researchers to explore the relationships between publications, data, and individuals."
All of these things are "on the Web", so I have added "The Web" to the list of infrastructure components above. Maybe this just goes without saying but I think it is worth emphasizing here.
For the Use Cases presented above, are we to assume that information seekers are limiting their searches to a bounded world of repositories? If so, then the challenge is merely to create services and interfaces that will support these requirements. If not, then it means creating repositories that are grounded in URI space and play well in the Web.
I agree that metadata needs to have a formal syntax and defined semantics for processing to be automated but would add that for use outside a bounded world of repositories, metadata should be expressed (or expressible) in the language of URIs and triples on the basis of Web architecture.
15 Feb
Jeremy Frumkin says:
I am wondering if governance / organization should be included somewhere here. What is needed to include a collection / repository within the scope of an international federation of repositories or digital libraries? While this isn't solely an access issue, it relates to Tom's comment above, and how a user's discovery experience is scoped based on their need. It seems that while it is important to break down the artificial barriers between digital libraries and the rest of the web, there is still the need to be able to retain context, at the very least for scoping purposes.
16 Feb
leo waaijers says:
Why should authors switch from their current practice of writing an article that refers to all sorts of additional material (research data, algorithms, visualisations, references, etc.) via hyperlinks to defining a Resource Map in the sense of OAI ORE? No doubt, a Resource Map with well defined triples is more sophisticated and better prepared for the semantic web then a hyperlinked piece of text. But it also more laborious and complex to produce. What then are incentives for authors to make the step; what makes them tick?