Evaluating the Evaluators: Subjective Bias and Consistency in Human Evaluation of Natural Language Generation

Amidei, Jacopo (2021). Evaluating the Evaluators: Subjective Bias and Consistency in Human Evaluation of Natural Language Generation. PhD thesis The Open University.

DOI: https://doi.org/10.21954/ou.ro.000124fa


The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand progress in the field. Based on an analysis of papers published over 10 years (from 2008 to 2018) in NLG-specific conferences and on an observational study, this thesis identifies shortcomings with existing approaches to reporting the reliability of evaluation studies in NLG. It proposes a new set of methods for identifying judges' bias and reporting reliability, specifically for human intrinsic evaluation of NLG systems.

In this thesis, we propose to use the correlation statistic and Item Response Theory (IRT) to analyse judges' bias for cases that involve a high level of language variability. Both techniques provide insights about the trustability of human judgements. Whereas the correlation statistic offers an approach to measure judges' relative consistency, IRT provides a tool to identify judges' bias.

We found support for the use of the correlation statistic through three case studies that show the limits of considering agreement coefficients as the only criterion for checking evaluation reliability. Given the variability of human language --- specifically variability in language interpretation and quality judgement --- expecting judges to always arrive at exactly the same judgement seems both unrealistic and over-constrained. The correlation coefficients can
be used to measure the extent to which judges follow a systematic pattern in their assessments, even when their individual interpretations of the phenomena are not identical.

Regarding IRT, we introduce a new interpretation and application of the technique to describe judges' bias. Using the QG-STEC evaluation dataset, and applying IRT to each judge, we show how to use IRT's probabilistic analysis to compare judges' bias and as result better characterize annotation disagreement. The new approach that we propose, can be used, for example, to spot judges who are outliers, improve annotation guidelines and arrive at an improved interpretation of the agreement coefficients.

Viewing alternatives

Download history


Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions

Item Actions