Goweder, Abduelbaset; Poesio, Massimo; De Roeck, Anne and Reynolds, Jeff
Identifying broken plurals in unvowelized Arabic text.
In: 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 24-25 July 2004, Barcelona, Spain.
Irregular (so-called broken) plural identification in modern standard Arabic is a problematic issue for information retrieval (IR) and language engineering applications, but their effect on the performance of IR has never been examined. Broken plurals (BPs) are formed by altering the singular (as in English: tooth teeth) through an application of interdigitating patterns on stems, and singular words cannot be recovered by standard affix stripping stemming techniques. We developed several methods for BP detection, and evaluated them using an unseen test set. We incorporated the BP detection component into a new light-stemming algorithm that conflates both regular and broken plurals with their singular forms. We also evaluated the new light-stemming algorithm within the context of information retrieval, comparing its performance with other stemming algorithms.
Actions (login may be required)