Measuring the similarity or difference between two strings is a fundamental problem to many applications. In bioinformatics, one has to predict the structures of RNA and proteins, to classify the functions of molecules, to infer the phylogeny of organisms, and to search entries in huge sequence databases. While processing electronic documents, one needs fast and flexible indexing techniques to perform searches. For this purpose, many measures are defined. The longest common subsequence and the edit distance are the most ...
Read More
Measuring the similarity or difference between two strings is a fundamental problem to many applications. In bioinformatics, one has to predict the structures of RNA and proteins, to classify the functions of molecules, to infer the phylogeny of organisms, and to search entries in huge sequence databases. While processing electronic documents, one needs fast and flexible indexing techniques to perform searches. For this purpose, many measures are defined. The longest common subsequence and the edit distance are the most studied dealt with problems in string processing. In this book, we propose an O(min{mN, Mn}) time algorithm for finding a longest common subsequence of strings X and Y with lengths m and n, respectively, and run-length-encoded lengths M and N, respectively. On the other hand, we also improve the time bound to O(min{mN, Mn}) for finding the edit distance between strings X and Y. Squares play a central role from word combinatorics and application perspective. We show how to locate all squares in a run-length encoded string in time O(N logN). The time complexity of our result is optimal, and it is irrelevant to the length of the original uncompressed string.
Read Less