Copy the page URI to the clipboard
De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul H.
(2004).
URL: http://mcs.open.ac.uk/anr29/PapersLinks/LREC04DeRo...
Abstract
We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with χ2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.
Viewing alternatives
Item Actions
Export
About
- Item ORO ID
- 22601
- Item Type
- Conference or Workshop Item
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Mathematics and Statistics - Research Group
- Centre for Research in Computing (CRC)
- Copyright Holders
- © 2004 The Authors
- Depositing User
- Sarah Frain