Identifying and Capturing the Semantic Aspects of Citations

Pride, David (2022). Identifying and Capturing the Semantic Aspects of Citations. PhD thesis The Open University.



This dissertation presents new work in understanding the nature of citations, their potential different purposes and how they are used. This research addresses a fundamental question: ‘Should all citations be treated equally?’ It explores the concept of influential citations and examines key features for their identification. It then investigates the key challenges in capturing these semantic aspects of citations and then presents solutions that overcome current limitations in the domain.

The dissertation provides an overview of current bibliometrics that use raw citation counts and then details the problems associated with these methodologies. It then presents an overview of two areas in which bibliometrics and citation data are already being applied: information retrieval and research evaluation. The first study (Chapter 3) focuses on the second of these areas, the use of citation data in research evaluation; it is the largest investigation to date into the correlation between peer review and bibliometrics at the institutional / discipline level, using data from the UK’s Research Excellence Framework (REF2014). This study was the first to identify strong correlations between simple citation-based indicators and aggregate peer review results in approximately one third of domains covered by the REF2014 process, notably those domains in which the peer review panels used citation data to inform their peer review decisions.

There is already wide-scale usage of bibliometrics and citation data in research evaluation, not only in exercises such as the REF2014, but also in other Performance Related Funding Exercises (PRFS) globally. Furthermore, citation-based metrics, such as the h-index (Hirsch [2005]) or Journal Impact Factor (JIF) (Garfield [1983]), are being used to measure individual academics, despite the aforementioned limitations and risks associated with these methodologies.

Critically, all of these metrics, without exception, treat all citations equally, even citations which refute or negate previous work. This seems not only illogical, but also limiting. There is far greater opportunity in understanding not only that a particular piece of research was cited, but why it was cited.

The study presented in Chapter 4 then examines the key features in identifying influential citations and replicates a range of features tested in prior works. The work then evaluates the current state of the art in the automatic identification of citation purpose using machine learning and natural language processing techniques and addresses the key challenges in this domain. It demonstrates that datasets compiled by earlier works are of limited size, largely due to the selected annotation methods, and are drawn from one or two domains at most. Experimental results show that this affects the accuracy of classification models built using these datasets.

The dissertation then presents a novel methodology, using first authors as citation annotators, and a new platform for the collection of citations annotated according to both purpose and influence. These tools were used to produce the largest dataset of annotated citations, covering 19 different disciplines, which can be used to foster new research in this domain and improve the performance of classification models.

Viewing alternatives

Download history


Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions

Item Actions