Nowak, Stefanie and Rüger, Stefan
How reliable are annotations via crowdsourcing? a study about inter-annotator agreement for multi-label image annotation.
In: The 11th ACM International Conference on Multimedia Information Retrieval (MIR), 29-31 Mar 2010, Philadelphia, USA.
Full text available as:
The creation of golden standard datasets is a costly business. Optimally more than one judgment per document is obtained to ensure a high quality on annotations. In this context, we explore how much annotations from experts differ from each other, how different sets of annotations influence the ranking of systems and if these annotations can be obtained with a crowdsourcing approach. This study is applied to annotations of images with multiple concepts. A subset of the images employed in the latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators and non-experts with Mechanical Turk. The inter-annotator agreement is computed at an image-based and concept-based level using majority vote, accuracy and kappa statistics. Further, the Kendall τ and Kolmogorov-Smirnov correlation test is used to compare the ranking of systems regarding different ground-truths and different evaluation measures in a benchmark scenario. Results show that while the agreement between experts and non-experts varies depending on the measure used, its influence on the ranked lists of the systems is rather small. To sum up, the majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent. The resulting annotation set is of comparable quality to the annotations of experts.
||This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
||experimentation; human factors; measurement; performance; inter-annotator agreement; crowdsourcing
||Knowledge Media Institute
||11 Jan 2011 12:40
||23 Oct 2012 21:26
Actions (login may be required)