Finding Minimal Absent Words in Multiple Strings
Explore methods for efficiently identifying minimal absent words across strings.
― 4 min read
Table of Contents
In the field of Strings and sequences, a minimal absent word (MAW) is a string that does not appear in another string, while all its smaller parts do. This idea has many uses in fields like biology, music, and data storage.
This article discusses extending the idea of MAWS to more than one string. We will describe how to find these MAWs efficiently using a method that involves multiple strings.
Basics of MAWs
A string is defined as an absent word for another string if it does not appear in it. A minimal absent word, on the other hand, is a type of absent word where all smaller parts of it can be found in the other string. For example, if we have a string "abc", the MAWs would be strings like "d" or "e" which are not found in "abc".
Research shows that the number of MAWs for a string of a certain length is predictable and has been studied extensively. Various Algorithms have been developed to compute MAWs for single strings, providing faster ways to determine them.
Extending MAWs to Multiple Strings
Our goal is to broaden the concept of MAWs from single strings to a set of strings. To do this, we need to consider a collection of strings, and we will introduce a way to identify which strings can be considered MAWs for this collection.
We will look for MAWs that meet specific conditions. If an absent word fulfills the conditions for being a MAW for all strings in the set, it qualifies. However, if it does not meet the conditions, it is not considered a MAW.
Algorithms for Finding MAWs
To tackle this problem, we will first focus on a smaller number of strings. We will present an algorithm that computes MAWs quickly, using memory efficiently.
Our approach builds a special structure called a Directed Acyclic Word Graph (DAWG). This structure helps to organize the strings and their parts, allowing us to find MAWs efficiently. Constructing the DAWG can be done in linear time, which makes our method practical.
Next, we extend our algorithm to handle more strings. The general case involves more complex operations, but we will show that it can be done in a reasonable time frame, even when dealing with a larger number of strings.
The Role of the DAWG
The DAWG is a crucial part of our approach. It is a data structure that helps keep track of the strings and their substrings. The DAWG can recognize all suffixes (or endings) of the strings in our set.
Using the DAWG, we can quickly determine which absent words are minimal. This is done efficiently, as we do not have to check every possible combination of strings; the DAWG allows for smart searching.
Handling Different Cases
Our algorithm considers various cases based on the labels assigned to the nodes of the DAWG. Each node represents a substring of one of the input strings. By carefully analyzing these nodes and their connections, we can determine which strings qualify as MAWs.
We categorize the cases into different groups. Depending on their labels, we set up rules that allow us to identify potential MAWs without unnecessary comparisons, speeding up the process.
Improving Efficiency
To further enhance our algorithm, we introduce skip links in the DAWG. These skip links allow us to bypass certain comparisons that do not need to be made, which saves time.
By creating lists of relevant edges in the DAWG, we can streamline the search for MAWs. This means we can compute MAWs more quickly than previous methods, even as the number of strings increases.
Conclusion
The work on generalized MAWs sets the stage for better understanding and handling of strings in various applications. By developing efficient algorithms and utilizing structures like the DAWG, we can tackle problems involving multiple strings much faster and with less resource use.
Future research can focus on finding even quicker methods for computing MAWs and exploring more complicated cases, which may further expand the applications of this concept in real-world settings.
The study of MAWs and their properties continues to be a rich area of research with many potential applications, and there is still much to learn about how to effectively compute and use these important structures.
Title: Linear-time computation of generalized minimal absent words for multiple strings
Abstract: A string $w$ is called a minimal absent word (MAW) for a string $S$ if $w$ does not occur as a substring in $S$ and all proper substrings of $w$ occur in $S$. MAWs are well-studied combinatorial string objects that have potential applications in areas including bioinformatics, musicology, and data compression. In this paper, we generalize the notion of MAWs to a set $\mathcal{S} = \{S_1, \ldots, S_k\}$ of multiple strings. We first describe our solution to the case of $k = 2$ strings, and show how to compute the set $\mathsf{M}$ of MAWs in optimal $O(n + |\mathsf{M}|)$ time and with $O(n)$ working space, where $n$ denotes the total length of the strings in $\mathcal{S}$. We then move on to the general case of $k > 2$ strings, and show how to compute the set $\mathsf{M}$ of MAWs in $O(n \lceil k / \log n \rceil + |\mathsf{M}|)$ time and with $O(n (k + \log n))$ bits of working space, in the word RAM model with machine word size $\omega = \log n$. The latter algorithm runs in optimal $O(n + |\mathsf{M}|)$ time for $k = O(\log n)$.
Authors: Kouta Okabe, Takuya Mieno, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai
Last Update: 2023-07-28 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2307.01967
Source PDF: https://arxiv.org/pdf/2307.01967
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.