Copy the page URI to the clipboard
Gyawali, Bikash; Anastasiou, Lucas and Knoth, Petr
(2020).
URL: https://www.aclweb.org/anthology/2020.lrec-1.113/
Abstract
Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.
Viewing alternatives
Download history
Item Actions
Export
About
- Item ORO ID
- 70519
- Item Type
- Conference or Workshop Item
- Project Funding Details
-
Funded Project Name Project ID Funding Body Deduplication and Dashboard for CORE CLS-087-54 - Jisc ref 4672 JISC (Joint Information Systems Committee) - Extra Information
- Held online
- Keywords
- deduplication; scholarly documents; locality sensitive hashing; word embeddings; digital repositories
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Knowledge Media Institute (KMi)
Faculty of Science, Technology, Engineering and Mathematics (STEM) - Research Group
- Big Scientific Data and Text Analytics Group (BSDTAG)
- Copyright Holders
- © 2020 European Language Resources Association (ELRA)
- Depositing User
- Lucas Anastasiou