The Open UniversitySkip to content

Integrating and visualising primary data from prospective and legacy taxonomic literature.

Miller, Jeremy A.; Agosti, Donat; Penev, Lyubomir; Sautter, Guido; Georgiev, Teodor; Catapano, Terry; Patterson, David; King, David; Pereira, Serrano; Vos, Rutger Aldo and Sierra, Soraya (2015). Integrating and visualising primary data from prospective and legacy taxonomic literature. Biodiversity Data Journal, 3, article no. e5063.

Full text available as:
PDF (Version of Record) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview
DOI (Digital Object Identifier) Link:
Google Scholar: Look up in Google Scholar


Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovery and description, and 2) the prevalence of rarity in species descriptions.

Item Type: Journal Item
Copyright Holders: 2015 The Authors
ISSN: 1314-2828
Project Funding Details:
Funded Project NameProject IDFunding Body
pro-iBiosphere312848EU FP7
EU-BON308454EU FP7
Keywords: biodiversity informatics; data mining; open access; spiders; taxonomy; XML markup
Academic Unit/School: Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Research Group: Centre for Research in Computing (CRC)
Item ID: 43466
Depositing User: Jane M. Bromley
Date Deposited: 12 Jun 2015 09:37
Last Modified: 10 Apr 2020 09:10
Share this page:


Altmetrics from Altmetric

Citations from Dimensions

Download history for this item

These details should be considered as only a guide to the number of downloads performed manually. Algorithmic methods have been applied in an attempt to remove automated downloads from the displayed statistics but no guarantee can be made as to the accuracy of the figures.

Actions (login may be required)

Policies | Disclaimer

© The Open University   contact the OU