Frequent Term Distribution Measures for Dataset Profiling

De Roeck, Anne; Sarkar, Avik and Garthwaite, Paul (2004). Frequent Term Distribution Measures for Dataset Profiling. Technical Report 2004/06; Department of Computing, The Open University.



We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with Chi-square based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.

Viewing alternatives

Download history


Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions

Item Actions