The Open UniversitySkip to content
 

HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

Lastra-Díaz, Juan J.; García-Serrano, Ana; Batet, Montserrat; Fernández, Miriam and Chirigati, Fernando (2017). HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems, 66 pp. 97–118.

Full text available as:
[img]
Preview
PDF (Version of Record) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview
DOI (Digital Object Identifier) Link: https://doi.org/10.1016/j.is.2017.02.002
Google Scholar: Look up in Google Scholar

Abstract

This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in (2015, 2016) [56–58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep.

Item Type: Journal Item
Copyright Holders: 2017 The Authors
ISSN: 0306-4379
Project Funding Details:
Funded Project NameProject IDFunding Body
Spanish MusaccesS2015/HUM3494Not Set
VEMODALENTIN2015-71785-RNot Set
Keywords: HESML; PosetHERep; Semantic measures library; Ontology-based semantic similarity measures; Intrinsic and corpus-based Information Content models; Reproducible experiments on word similarity; WNSimRep v1 dataset; ReproZip; WordNet-based semantic similarity measures
Academic Unit/School: Faculty of Science, Technology, Engineering and Mathematics (STEM) > Knowledge Media Institute (KMi)
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Interdisciplinary Research Centre: Centre for Research in Computing (CRC)
Centre for Policing Research and Learning (CPRL)
Item ID: 49732
Depositing User: Kay Dave
Date Deposited: 26 Jun 2017 14:20
Last Modified: 26 Jun 2017 14:22
URI: http://oro.open.ac.uk/id/eprint/49732
Share this page:

Altmetrics

Download history for this item

These details should be considered as only a guide to the number of downloads performed manually. Algorithmic methods have been applied in an attempt to remove automated downloads from the displayed statistics but no guarantee can be made as to the accuracy of the figures.

Actions (login may be required)

Policies | Disclaimer

© The Open University   + 44 (0)870 333 4340   general-enquiries@open.ac.uk