A Generalization of the Zipf-Mandelbrot Law in Linguistics

Montemurro, Marcelo A. (2004). A Generalization of the Zipf-Mandelbrot Law in Linguistics. In: Gell-Mann, Murray and Tsallis, Constantino eds. Nonextensive Entropy: Interdisciplinary Applications. Santa Fe Institute studies of Complexity. New York: Oxford University Press, pp. 347–356.

DOI: https://doi.org/10.1093/oso/9780195159769.003.0025

Abstract

Human language evolved by natural mechanisms into an efficient system capable of coding and transmitting highly structured information [12, 13, 14]. As a remarkable complex system it allows many levels of description across its organizational hierarchy [1, 11, 18]. In this context statistical analysis stands as a valuable tool in order to reveal robust structural patterns that may have resulted from its long evolutionary history. In this chapter we shall address the statistical regularities of human language at its most basic level of description, namely the rank-frequency distribution of words. Around 1932 the philologist George Zipf [6, 19, 20] noted the manifestation of several robust power-law distributions arising in different realms of human activity. Among them, the most striking was undoubtedly the one referring to the distribution of words frequencies in human languages. The best way to introduce Zipf's law for words is by means of a concrete example. Let us take a literary work, say, James Joyce's Ulysses, and perform some basic statistics on it, whic simply consists in counting all the words present in the text and noting how many occurrences each distinct word form has. For this particular text we should arrive at the following numbers: the total number of words N = 268,112, and the number of different word forms V = 28,838. We can now order the list of different words according to decreasing number of occurrences, and we can assign to each word a rank index s equal to its position in the list starting from the most frequent word. Some general features of the rank-ordered list of words can be mentioned at this point. First, the top-rank words are functional components of language devoid of direct meaning, such as the article the and prepositions, for instance. A few ranks down the list, words more related to the contents of the text start to appear.

Viewing alternatives

Metrics

Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions

Item Actions

Export

About

Recommendations