The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction
The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction
Abstract
A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless, the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction.
Introduction
Introduction
Language is the human capability for communication via vocal or visual signs. Language can be regarded as a complex system, where words are constituents which interact with each other to form particular patterns. Such patterns represent human thoughts, feelings, will, and knowledge which are called meaning. Human language is unique among other communication systems, because there are a lots of words to express the immaterial and intellectual concepts. In addition, the existence of synonymy, polysemy and so on increases its complexity. Texts, as the written form of language, inherit its complexity. A text can be partially understood through regularities in spatial distribution of words and their frequencies. Research has shown that regularity in a text can be expressed as a power law relationship. One of the most well-known power laws is Zipf's law, which shows that if we rank the words in a text from the most common to the least, the frequency of each word is inversely proportional to its rank. A related law, Heaps' law, shows another universal feature of texts: the number of distinct words in a text
(i.e., number of word types), changes with the text size (i.e., the number of tokens) in the form of a power law. Another level of regularity is evident only through the pattern of words throughout a text. A text is not just a random collection of words; we can only call this collection a text if it has meaning. In other words, the words in a text must be placed in a specific order to impart meaning. Many power laws cannot capture this fact: any random shuffling process drastically destroys the meaning of a text, but Zipf's law remains unchanged and Heaps' law changes only very slightly.
The particular arrangement of words in a specific order arises for two reasons. First, grammatical rules determine where words should be placed within a sentence and specify the position of verbs, nouns, adverbs, and other parts of speech. Grammatical rules make short range correlations between the sequences of words in a sentence. Secondly, a text derives meaning from how the words are arranged throughout. This ordering is called semantic ordering, and acts across the whole range of the text, hence the long-range correlation can be seen between the positions of any word. The broad meaning of a text also means that different word types have different importance in a text. We can distinguish between two kinds of content words in a text: those which are related to the subject of the text (i.e., the important words), and all others that are irrelevant to it. For a text in cosmology, words like universe, space, big-bang, and inflation are important words. Other words such as is, fact, happening, etc., are irrelevant to the topic of the text. Finding an index for quantifying the importance of words in a given text is crucial to detecting keywords automatically, and provides a very useful starting point for text summarization, document categorization, machine translation and other matters related to automatic information retrieval. Automating these processes is of increasing importance given the increasing size of available information yet limited manpower.
In the current paper, we use the concept of fractal to assign an importance value to every word in a given text. A fractal is a mathematical object (e.g., a set of points in Euclidean space) that has repeating patterns at every scales, it means at any magnification there is a smaller piece of the object that is similar to the whole; this property is called self-similarity. The fractal dimension shows how detail of a fractal pattern changes with scale. It is used as an index of complexity. The fractal dimension of a set is equal or less than the topological dimension of space that the set is embedded in it. We claim that the positions of a word type within the text array form a fractal pattern with a specified dimension that is a positive value less than or equal to one. Based on this fact, an index is presented for ranking the vocabulary words of a given text. The difference between the pattern of a word in the original text versus a randomly shuffled version shows its importance: words with a greater differential between the original and shuffled texts are more important. We compare this approach with other more well-known methods of keyword extraction.
In the following section we review previous research reporting a kind of fractal structure in texts, in order to show that our method is novel. Then we review some basic ideas for keyword extraction which are useful for understanding the different principles currently at work in the field. Finally, we describe our method and how it could be evaluated, and report the results for a sample book.