Roberts, Dave; King, David; Rycroft, Simon; Morse, David; Penev, Lyubomir; Agosti , Donat and Smith, Vince
Community web sites: small pieces loosely joined.
In: 4th International Conference on Metadata and Semantics Research, 20-22 Oct 2010, Alcalá de Henares, Madrid, Spain.
Metadata are, in essence, information about a resource that allows its retrieval when needed for a particular purpose, functionally equivalent to an index entry. Some resources, such as pictures, video and sound, cannot be searched for particular content so need an associated text element that describes the image and can be used to recover a specific image from amongst many.
This is not, of itself, a difficult thing to do, but it does represent a significant task overhead. People building data resources for a particular purpose will include minimal metadata that is sufficient to solve the immediate task in hand and will not, generally, invest the additional time to build more extensive and more broadly useful metadata.
Scratchpads are an intuitive web application that enables researchers collaboratively to build, share, manage and publish their biodiversity data online. The key concepts here are that many individuals contribute small items of information that are joined in a flexible architecture. By pooling these resources and by sharing in the development of the architecture, collaborative communities build up and the web site becomes the repository and the resource for further work.
Tools for adding metadata
Where resources are not immediately machine-readable, such as images, there are no practical alternatives to tagging 'by hand'. Tools have been developed within the Scratchpad environment that allow bulk annotation of images. A group of users define a data structure to contain the metadata, then multiple images can be selected and information common to the group (e.g. locality, expedition, species name) can be added to the relevant fields in one process. The only advantage that this offers is a reduction in the repetitive labour of tagging many pictures, which is not a significant advance in IT strategy, is a major benefit to those actually doing the work.
Organisation of Information
Individuals will easily create or identify resources that are relevant to a particular study domain using one of a range of tools, including personal bibliographies, Google searches, specialised databases such as EMBL, bibliometric tools, e.g. the Web of Science, and so on. From those resources, a term-list can be built which can be used to create a controlled vocabulary. The controlled vocabulary can be used to create a formal ontology. In the task of organising and recovering information the most immediately useful of these stages is the controlled vocabulary, especially if each term is mapped to a set of synonyms (a thesaurus). Each developmental step, however, requires a significant input of effort, e.g. the extraction of a term list from resources. At each stage the person doing the work has to be confident that the organised product will be of enough utility to repay the labour of it creation.
Analysis of Resources
Users will identify resources that are in some sense relevant to their domain of interest, as described above. We can use a computer to decipher text resources, such as published papers, and identify key structural elements within the resource. This is most effectively done by using clues not normally incorporated in conventional NLP techniques that generally discard punctuation and typographical cues to leave only the text. Hence, looking for key terms becomes far more difficult than it need be.
It is part of the scientific tradition that in formal descriptive writing we use a greater proportion of latinate words that typically occur in general text. Fairly simple rules allow us to identify candidate latin words and we can use the candidates to feed a learning algorithm, especially if we have access to a dictionary describing how those terms are used. Thus we can discriminate between a text that is describing a taxon (high proportion of anatomical terms) from one that describes, for example, ecological impacts (few anatomical terms).
Once terms lists start to be accumulated they can be used to lever greater meaning from a text. For example, we already have long lists of latinised species names, so we can look for those names in a text and seek to establish how they are being used: specifically if two names occur in close proximity we can ask what the relationship is between them.
Briefly, if we could isolate the proper nouns from a text block they would provide the "who", "where" and "what" metadata, leaving us with the challenge of deducing the "why".
Building Searchable Resources
There are enormous numbers of potentially suitable XML schemas available, but few that encompass what taxonomists want recorded in metadata. The Plazi project has developed an extension to the widely used NLM schema called TaxPub. The Plazi project and PenSoft Publishers have developed an assisted workflow, not automatic but using productivity tools to reduce the time needed to process a single page to a few minutes.
The use of standard XML schemas is very important because it allows the development of increasingly elegant queries. Whereas the original impetus for the Scratchpads was to mobilise taxonomic information, it quickly became apparent that there are many more uses for, for instance, occurrence data than taxonomy. The focus of our development efforts are, therefore, to extend the application domain into the environmental arena.
At the end of the day, users will engage with any system that delivers direct and clear benefit to them personally. The underlying Scratchpad database delivers organisational benefits and is vastly easier to maintain than traditional web pages. The authors benefit from increased exposure and international recognition of their expertise. As the consortium behind a particular Scratchpad grows, the underlying database becomes a richer resource that can be used to probe different types of problem. Search structures across many Scratchpads deliver the benefit of access to information originally assembled for a different purpose (taxonomy into ecology and visa versa).
The EU project EDIT has demonstrated that existing technology is easily capable of delivering these benefits, but the barriers are sociological. It seems to be easier to build and retain engagement if progress comes as small incremental steps, each delivering a discrete benefit. Release of a complete, polished solution will generally represent a significant learning curve and require the user to change their work-practice in a significant way.
The principles described above have been brought together with the recent release of a special issue of the journal ZooKeys. Here data were entered into a Scratchpad, then at the click of a button, rendered into an XML version that was sent to the publisher, who automatically transformed it into a PDF version that was sent to referees. The journal's editorial team only spent time on the paper when referee's comments were received and, when the papers were accepted, further marked up the content to facilitate text-mining. The papers were published only 4 weeks after that first click, in print, PDF, semantically enhanced HTML, and XML versions. The XML version is archived in PubMedCentral and portions of tagged text (e.g., taxon treatments) are automatically harvested and exported to aggregators such as EOL and Plazi.
Actions (login may be required)