The Open UniversitySkip to content

ComTax: community-driven curation for taxonomic databases

Morse, David; Yang, Hui; Willis, Alistair; De Roeck, Anne and King, David (2013). ComTax: community-driven curation for taxonomic databases. In: TDWG 2013, 27 Oct - 1 Nov 2013, Florence, Italy.

Full text available as:
PDF (Version of Record) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (627kB) | Preview
Google Scholar: Look up in Google Scholar


This poster presents the work of the ComTax project to develop a community-driven curation process among practicing scientists and citizen scientists. The project provides tools to help scientists identify and validate appropriate taxonomic names from the scanned historical literature. The system operates on scanned documents, typically taken from the Biodiversity Heritage Library, although documents sourced from other repositories could be used.

The system is intended to be used on uncorrected text after optical character recognition (OCR) on the scanned images. The key stages are:

1. Identify possible taxonomic names in the scanned text using machine learning techniques.

2. Verify the extracted names against existing databases. If present, the source scanned text can be automatically marked-up with the name.

3. Unverified names might mean they are not currently recorded in the verification databases, typically because the old name in the literature has been reclassified, or because erroneous OCR means that the name is incorrectly transcribed in the scanned text. In either case:

3.1. Present the proposed name to domain experts or citizen scientists for validation or correction, potentially through a voting mechanism to collect expert judgments on the putative taxonomic name.

3.2. Mark-up the scanned text with the corrected spelling of the name and offer validated taxonomic names for further use by the community.

This poster will describe the technical challenges facing the ComTax project, and highlight potential extensions of the work to the curation of other entities of interest in the legacy literature or of different disciplines.

Item Type: Conference or Workshop Item
Copyright Holders: 2013 Open University
Project Funding Details:
Funded Project NameProject IDFunding Body
Not SetNot SetJISC
Keywords: curation; data; biodiversity; taxonomy
Academic Unit/School: Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Research Group: Centre for Research in Computing (CRC)
Related URLs:
Item ID: 39139
Depositing User: David King
Date Deposited: 12 Dec 2013 09:46
Last Modified: 07 Dec 2018 23:04
Share this page:

Download history for this item

These details should be considered as only a guide to the number of downloads performed manually. Algorithmic methods have been applied in an attempt to remove automated downloads from the displayed statistics but no guarantee can be made as to the accuracy of the figures.

Actions (login may be required)

Policies | Disclaimer

© The Open University   contact the OU