The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction

100%

The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction

Abstract

A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless, the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction.

Introduction

Language is the human capability for communication via vocal or visual signs. Language can be regarded as a complex system, where words are constituents which interact with each other to form particular patterns. Such patterns represent human thoughts, feelings, will, and knowledge which are called meaning. Human language is unique among other communication systems, because there are a lots of words to express the immaterial and intellectual concepts. In addition, the existence of synonymy, polysemy and so on increases its complexity. Texts, as the written form of language, inherit its complexity. A text can be partially understood through regularities in spatial distribution of words and their frequencies. Research has shown that regularity in a text can be expressed as a power law relationship. One of the most well-known power laws is Zipf's law, which shows that if we rank the words in a text from the most common to the least, the frequency of each word is inversely proportional to its rank. A related law, Heaps' law, shows another universal feature of texts: the number of distinct words in a text

(i.e., number of word types), changes with the text size (i.e., the number of tokens) in the form of a power law. Another level of regularity is evident only through the pattern of words throughout a text. A text is not just a random collection of words; we can only call this collection a text if it has meaning. In other words, the words in a text must be placed in a specific order to impart meaning. Many power laws cannot capture this fact: any random shuffling process drastically destroys the meaning of a text, but Zipf's law remains unchanged and Heaps' law changes only very slightly.

The particular arrangement of words in a specific order arises for two reasons. First, grammatical rules determine where words should be placed within a sentence and specify the position of verbs, nouns, adverbs, and other parts of speech. Grammatical rules make short range correlations between the sequences of words in a sentence. Secondly, a text derives meaning from how the words are arranged throughout. This ordering is called semantic ordering, and acts across the whole range of the text, hence the long-range correlation can be seen between the positions of any word. The broad meaning of a text also means that different word types have different importance in a text. We can distinguish between two kinds of content words in a text: those which are related to the subject of the text (i.e., the important words), and all others that are irrelevant to it. For a text in cosmology, words like universe, space, big-bang, and inflation are important words. Other words such as is, fact, happening, etc., are irrelevant to the topic of the text. Finding an index for quantifying the importance of words in a given text is crucial to detecting keywords automatically, and provides a very useful starting point for text summarization, document categorization, machine translation and other matters related to automatic information retrieval. Automating these processes is of increasing importance given the increasing size of available information yet limited manpower.

In the current paper, we use the concept of fractal to assign an importance value to every word in a given text. A fractal is a mathematical object (e.g., a set of points in Euclidean space) that has repeating patterns at every scales, it means at any magnification there is a smaller piece of the object that is similar to the whole; this property is called self-similarity. The fractal dimension shows how detail of a fractal pattern changes with scale. It is used as an index of complexity. The fractal dimension of a set is equal or less than the topological dimension of space that the set is embedded in it. We claim that the positions of a word type within the text array form a fractal pattern with a specified dimension that is a positive value less than or equal to one. Based on this fact, an index is presented for ranking the vocabulary words of a given text. The difference between the pattern of a word in the original text versus a randomly shuffled version shows its importance: words with a greater differential between the original and shuffled texts are more important. We compare this approach with other more well-known methods of keyword extraction.

In the following section we review previous research reporting a kind of fractal structure in texts, in order to show that our method is novel. Then we review some basic ideas for keyword extraction which are useful for understanding the different principles currently at work in the field. Finally, we describe our method and how it could be evaluated, and report the results for a sample book.

Background and Related works

Principles for Keyword Extraction

Methods

Evaluation of the Method.

Results.

Ranking the words and keyword detection.

Evaluation of Our Method

Conclusion

Author Contributions

Overview

This paper introduces a novel approach to automatic keyword extraction by analyzing the fractal dimensions of words in a text. It argues that the significance of words can be quantified through their fractal patterns, with implications for various applications in text summarization and document categorization.

Key Points

1Important words in a text have fractal dimensions significantly different from unimportant ones
2An index for ranking words based on their fractal patterns is proposed
3The method compares favorably with established techniques for keyword extraction
4Fractal dimensions are useful for quantifying the importance of individual words in texts
5The study highlights the relevance of self-similarity in text structure for keyword detection.

Details

Authors: Elham Najafi, Amir H. Darooneh
Category: Technology and Engineering

PDF
Group Decision Support Systems and Executive Support Systems
This document presents an overview of Group Decision Support Systems (GDSS) and Executive Support Systems (ESS), detailing their functions, benefits, limitations, and characteristics to aid in collective decision-making processes in business contexts.
PDF
Information Management and Decision Making
This document provides an overview of Decision Support Systems (DSS), discussing their components, types, and the importance of information management in decision-making processes for business executives.
PDF
Electronic Communication Systems
The document provides an overview of electronic communication systems, discussing various types such as electronic conferencing, meeting systems, and publishing, while also addressing benefits and risks associated with electronic publishing.
PDF
End-User Computing
This document provides a comprehensive overview of End-User Computing (EUC), discussing its definition, types, benefits, risks, and the tools used in EUC environments. It aims to explain how non-programmers can effectively participate in computing processes and develop their own applications.
PDF
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness
This document presents FLASHATTENTION, an IO-aware exact attention algorithm that significantly improves the speed and memory efficiency of Transformers on long sequences. It introduces methods to reduce memory reads and writes, resulting in faster training and better model performance compared to existing methods.