Copy the page URI to the clipboard
Montemurro, Marcelo A.
(2004).
DOI: https://doi.org/10.1093/oso/9780195159769.003.0025
Abstract
Human language evolved by natural mechanisms into an efficient system capable of coding and transmitting highly structured information [12, 13, 14]. As a remarkable complex system it allows many levels of description across its organizational hierarchy [1, 11, 18]. In this context statistical analysis stands as a valuable tool in order to reveal robust structural patterns that may have resulted from its long evolutionary history. In this chapter we shall address the statistical regularities of human language at its most basic level of description, namely the rank-frequency distribution of words. Around 1932 the philologist George Zipf [6, 19, 20] noted the manifestation of several robust power-law distributions arising in different realms of human activity. Among them, the most striking was undoubtedly the one referring to the distribution of words frequencies in human languages. The best way to introduce Zipf's law for words is by means of a concrete example. Let us take a literary work, say, James Joyce's Ulysses, and perform some basic statistics on it, whic simply consists in counting all the words present in the text and noting how many occurrences each distinct word form has. For this particular text we should arrive at the following numbers: the total number of words N = 268,112, and the number of different word forms V = 28,838. We can now order the list of different words according to decreasing number of occurrences, and we can assign to each word a rank index s equal to its position in the list starting from the most frequent word. Some general features of the rank-ordered list of words can be mentioned at this point. First, the top-rank words are functional components of language devoid of direct meaning, such as the article the and prepositions, for instance. A few ranks down the list, words more related to the contents of the text start to appear.
Viewing alternatives
Metrics
Public Attention
Altmetrics from AltmetricNumber of Citations
Citations from DimensionsItem Actions
Export
About
- Item ORO ID
- 79630
- Item Type
- Book Section
- ISBN
- 0-19-515976-4, 978-0-19-515976-9
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Mathematics and Statistics
Faculty of Science, Technology, Engineering and Mathematics (STEM) - Depositing User
- Marcelo Montemurro