The Open UniversitySkip to content
 

Frequent term distribution measures for dataset profiling

De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul H. (2004). Frequent term distribution measures for dataset profiling. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (Lino, M. T.; Xavier, M. F.; Ferreira, F.; Costa, R. and Silva, R. eds.), ELRA, Paris, pp. 1647–1650.

URL: http://mcs.open.ac.uk/anr29/PapersLinks/LREC04DeRo...
Google Scholar: Look up in Google Scholar

Abstract

We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with χ2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.

Item Type: Conference Item
Copyright Holders: 2004 The Authors
Academic Unit/Department: Other Departments > Vice-Chancellor's Office
Other Departments
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM)
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Mathematics and Statistics
Interdisciplinary Research Centre: Centre for Research in Computing (CRC)
Item ID: 22601
Depositing User: Sarah Frain
Date Deposited: 17 Aug 2010 11:46
Last Modified: 02 Aug 2016 13:44
URI: http://oro.open.ac.uk/id/eprint/22601
Share this page:

▼ Automated document suggestions from open access sources

Actions (login may be required)

Policies | Disclaimer

© The Open University   + 44 (0)870 333 4340   general-enquiries@open.ac.uk