Copy the page URI to the clipboard
King, David; Morse, David and Lyal, Chris
(2013).
Abstract
To assess global challenges surrounding issues such as climate change and invasive species requires a baseline of historical data. We are fortunate in biodiversity that such data exists in a rich body of literature. One such source of historical data is the Biologia Centrali-Americana (BCA), which documents the plant and animal life in Central America one hundred years’ ago, and which can be compared to contemporary species distributions. This valuable resource has recently been re-keyed and manually marked up by the INOTAXA project (http://www.inotaxa.org/). The 56-volume work is now being curated before wider release.
The manual annotation of the BCA is both time consuming in its initial phases and demands expert review to curate the results. This manual approach to mining historic texts is not viable for large-scale works such as the BCA. Attempts to automate the process face the problem of not having suitable corpora against which to develop and then test automated solutions such as text mining. One project, ViBRANT (http://vbrant.eu/), sought to use the scale of the re-keyed data being produced by INOTAXA to develop a solution to this problem. However, this apparently straightforward task has thrown up many issues because different audiences have different requirements of the mark up.
This presentation describes the process by which the BCA is being reworked from digitisation through to a curated document corpus. The intended users are biodiversity scientists who can use the corpus for taxonomic and biodiversity research, and computer scientists who can use it to develop new text mining and mark up tools. The presentation covers the different requirements of scientists in the two domains, how this affects the mark up required of the documents, and how to re-purpose the annotations to meet the needs of different and sometimes disparate scientific audiences.