The Open UniversitySkip to content

Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

Gyawali, Bikash; Anastasiou, Lucas and Knoth, Petr (2020). Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In: Proceedings of The 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp. 894–903.

Full text available as:
PDF (Version of Record) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview
Google Scholar: Look up in Google Scholar


Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.

Item Type: Conference or Workshop Item
Copyright Holders: 2020 European Language Resources Association (ELRA)
Project Funding Details:
Funded Project NameProject IDFunding Body
Deduplication and Dashboard for CORECLS-087-54 - Jisc ref 4672JISC (Joint Information Systems Committee)
Extra Information: Held online
Keywords: deduplication; scholarly documents; locality sensitive hashing; word embeddings; digital repositories
Academic Unit/School: Faculty of Science, Technology, Engineering and Mathematics (STEM) > Knowledge Media Institute (KMi)
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Research Group: Big Scientific Data and Text Analytics Group (BSDTAG)
Item ID: 70519
Depositing User: Bikash Gyawali
Date Deposited: 01 Jun 2020 09:32
Last Modified: 06 Jul 2020 17:18
Share this page:

Download history for this item

These details should be considered as only a guide to the number of downloads performed manually. Algorithmic methods have been applied in an attempt to remove automated downloads from the displayed statistics but no guarantee can be made as to the accuracy of the figures.

Actions (login may be required)

Policies | Disclaimer

© The Open University   contact the OU