De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul H.
(2004).
| URL: | http://mcs.open.ac.uk/anr29/PapersLinks/LREC04DeRo... |
|---|---|
| Google Scholar: | Look up in Google Scholar |
Abstract
We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with χ2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.
| Item Type: | Conference Item |
|---|---|
| Copyright Holders: | 2004 The Authors |
| Academic Unit/Department: | Mathematics, Computing and Technology Mathematics, Computing and Technology > Computing Mathematics, Computing and Technology > Mathematics and Statistics |
| Interdisciplinary Research Centre: | Centre for Research in Computing (CRC) |
| Item ID: | 22601 |
| Depositing User: | Sarah Frain |
| Date Deposited: | 17 Aug 2010 11:46 |
| Last Modified: | 02 Dec 2010 21:01 |
| URI: | http://oro.open.ac.uk/id/eprint/22601 |
Actions (login may be required)
| View Item | |
| Public: Report issue / request change |




