Hjalte Wedel Vildhøj

Contact information

E-mail: kd.vwh@vwh

Technical University of Denmark
DTU Compute
Building 322, Office 008
DK-2800 Kongens Lyngby
Denmark
Show map

I'm a postdoc in the section for Algorithms, Logic and Graphs (AlgoLoG) at the Technical University of Denmark, DTU Compute. My PhD advisors were Inge Li Gørtz and Philip Bille. My research is centered around the design of algorithms, data structures and time-space trade-offs for combinatorial pattern matching problems. I am particularly interested in generalizations and variations of exact string matching and indexing.

When not in my office, I'm most likely travelling, riding my bike or hanging out on DTU's climbing wall. I also love to go spearfishing.

Publications

• Boxed Permutation Pattern Matching (pdf)
with Mika Amit, Philip Bille, Patrick Hagge Cording and Inge Li Gørtz
preprint 2016
Abstract: Given permutations $$T$$ and $$P$$ of length $$n$$ and $$m$$, respectively, the Permutation Pattern Matching problem asks to find all $$m$$-length subsequences of $$T$$ that are order-isomorphic to $$P$$. This problem has a wide range of applications but is known to be NP-hard. In this paper, we study the special case, where the goal is to only find the boxed subsequences of $$T$$ that are order-isomorphic to $$P$$. This problem was introduced by Bruner and Lackner who showed that it can be solved in $$O(n^3)$$ time. Cho et al. [CPM 2015] gave an $$O(n^2m)$$ time algorithm and improved it to $$O(n^2 \log m)$$. In this paper we present a solution that uses only $$O(n^2)$$ time. In general, there are instances where the output size is $$\Omega(n^2)$$ and hence our bound is optimal. To achieve our results, we introduce several new ideas including a novel reduction to 2D offline dominance counting. Our algorithm is surprisingly simple and straightforward to implement.
• Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation (pdf)
with Philip Bille, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen and Søren Vind
preprint 2015
Abstract: Given a static reference string $$R$$ and a source string $$S$$, a relative compression of $$S$$ with respect to $$R$$ is an encoding of $$S$$ as a sequence of references to substrings of $$R$$. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data set such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string $$S$$ is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures, that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted or extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem.
• Sparse Text Indexing in Small Space (pdf)
with Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz and Benjamin Sach
ACM Transactions on Algorithms (2015), to appear
• Conference version: Sparse Suffix Tree Construction in Small Space (pdf) (slides)
with Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz and Benjamin Sach
Proc. 40th International Colloquium on Automata, Languages and Programming (ICALP 2013)
Abstract: We consider the problem of constructing a sparse suffix tree (or suffix array) for $$b$$ suffixes of a given text $$T$$ of length $$n$$, using only $$O(b)$$ words of space during construction. Attempts at breaking the naive bound of $$\Omega(nb)$$ time for this problem can be traced back to the origins of string indexing in 1968. First results were only obtained in 1996, but only for the case where the suffixes were evenly spaced in $$T$$. In this paper there is no constraint on the locations of the suffixes. We show that the sparse suffix tree can be constructed in $$O(n\log^2b)$$ time. To achieve this we develop a technique, which may be of independent interest, that allows to efficiently answer $$b$$ longest common prefix queries on suffixes of $$T$$, using only $$O(b)$$ space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Our first solution is Monte-Carlo and outputs the correct tree with high probability. We then give a Las-Vegas algorithm which also uses $$O(b)$$ space and runs in the same time bounds with high probability when $$b = O(\sqrt{n})$$. Furthermore, additional tradeoffs between the space usage and the construction time for the Monte-Carlo algorithm are given.
• Longest Common Extensions in Sublinear Space (pdf)
with Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen and Moshe Lewenstein
Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015)
Abstract: The longest common extension problem (LCE problem) is to construct a data structure for an input string $$T$$ of length $$n$$ that supports $$\text{LCE}(i,j)$$ queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions $$i$$ and $$j$$ in $$T$$. This classic problem has a well-known solution that uses $$O(n)$$ space and $$O(1)$$ query time. In this paper we show that for any trade-off parameter $$1 \leq \tau \leq n$$, the problem can be solved in $$O(\frac{n}{\tau})$$ space and $$O(\tau)$$ query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.
• A Suffix Tree or Not A Suffix Tree? (pdf)
with Tatiana Starikovskaya
Journal of Discrete Algorithms (2015), doi: 10.1016/j.jda.2015.01.005
• Conference version: A Suffix Tree or Not A Suffix Tree? (pdf) (slides: keynote, pdf)
with Tatiana Starikovskaya
Proc. 25th International Workshop on Combinatorial Algorithms (IWOCA 2014)
Abstract: In this paper we study the structure of suffix trees. Given an unlabeled tree $$\tau$$ on $$n$$ nodes and suffix links of its internal nodes, we ask the question "Is $$\tau$$ a suffix tree?", i.e., is there a string $$S$$ whose suffix tree has the same topological structure as $$\tau$$? We place no restrictions on $$S$$, in particular we do not require that $$S$$ ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if $$\tau$$ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes $$\tau$$ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that $$\tau$$ is a suffix tree if and only if it is realized by a string $$S$$ of length $$n-1$$, and we give a linear-time algorithm for inferring $$S$$ when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014].
• Sublinear Space Algorithms for the Longest Common Substring Problem (pdf)
with Tomasz Kociumaka and Tatiana Starikovskaya
Proc. 22nd European Symposium on Algorithms (ESA 2014)
Abstract: Given $$m$$ documents of total length $$n$$, we consider the problem of finding a longest string common to at least $$d \geq 2$$ of the documents. This problem is known as the longest common substring (LCS) problem and has a classic $$O(n)$$ space and $$O(n)$$ time solution (Weiner [FOCS'73], Hui [CPM'92]). However, the use of linear space is impractical in many applications. In this paper we show that for any trade-off parameter $$1 \leq \tau \leq n$$, the LCS problem can be solved in $$O(\tau)$$ space and $$O(n^2/\tau)$$ time, thus providing the first smooth deterministic time-space trade-off from constant to linear space. The result uses a new and very simple algorithm, which computes a $$\tau$$-additive approximation to the LCS in $$O(n^2/\tau)$$ time and $$O(1)$$ space. We also show a time-space trade-off lower bound for deterministic branching programs, which implies that any deterministic RAM algorithm solving the LCS problem on documents from a sufficiently large alphabet in $$O(\tau)$$ space must use $$\Omega(n\sqrt{\log(n/(\tau\log n))/\log\log(n/(\tau\log n)})$$ time.
• Fingerprints in Compressed Strings (pdf)
with Philip Bille, Patrick Hagge Cording, Inge Li Gørtz, Benjamin Sach and Søren Vind
Proc. 13th Algorithms And Data Structures Symposium (WADS 2013)
Abstract: The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string $$S$$ of size $$N$$ compressed by a context-free grammar of size $$n$$ that answers fingerprint queries. That is, given indices $$i$$ and $$j$$, the answer to a query is the fingerprint of the substring $$S[i,j]$$. We present the first $$O(n)$$ space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get $$O(\log N)$$ query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get $$O(\log \log N)$$ query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time $$O(\log N\log \ell)$$ and $$O(\log \ell \log\log \ell + \log\log N)$$ for SLPs and Linear SLPs, respectively. Here, $$\ell$$ denotes the length of the LCE.
• Time-Space Trade-Offs for the Longest Common Substring Problem (pdf) (slides)
with Tatiana Starikovskaya
Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM 2013)
Abstract: The Longest Common Substring problem is to compute the longest substring which occurs in at least $$d \geq 2$$ of $$m$$ strings of total length $$n$$. In this paper we ask the question whether this problem allows a deterministic time-space trade-off using $$O(n^{1+\epsilon})$$ time and $$O(n^{1-\epsilon})$$ space for $$0 \leq \epsilon \leq 1$$. We give a positive answer in the case of two strings ($$d=m=2$$) and $$0 < \epsilon \leq 1/3$$. In the general case where $$2 \leq d \leq m$$, we show that the problem can be solved in $$O(n^{1-\varepsilon})$$ space and $$O( n^{1+\varepsilon}\log^2 n (d \log^2 n + d^2))$$ time for any $$0 \leq \varepsilon < 1/3$$.
• The Hardness of the Functional Orientation 2-Color Problem (pdf)
with Søren Bøg and Morten Stöckel
The Australasian Journal of Combinatorics vol. 56 (2013), pages 225-234
Abstract: We consider the Functional Orientation 2-Color problem, which was introduced by Valiant in his seminal paper on holographic algorithms [SIAM J. Comput., 37(5), 2008]. For this decision problem, Valiant gave a polynomial time holographic algorithm for planar graphs of maximum degree 3, and showed that the problem is NP-complete for planar graphs of maximum degree 10. A recent result on defective graph coloring by Corrêa et al. [Australas. J. Combin., 43, 2009] implies that the problem is already hard for planar graphs of maximum degree 8. Together, these results leave open the hardness question for graphs of maximum degree between 4 and 7. We close this gap by showing that the answer is always yes for arbitrary graphs of maximum degree 5, and that the problem is NP-complete for planar graphs of maximum degree 6. Moreover, for graphs of maximum degree 5, we note that a linear time algorithm for finding a solution exists.
• Time-Space Trade-Offs for Longest Common Extensions (pdf)
with Philip Bille, Inge Li Gørtz and Benjamin Sach
Journal of Discrete Algorithms (2013), doi:10.1016/j.jda.2013.06.003
• Conference version: Time-Space Trade-Offs for Longest Common Extensions (pdf) (slides)
with Philip Bille, Inge Li Gørtz and Benjamin Sach
Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012)
Abstract: We revisit the longest common extension (LCE) problem, that is, preprocess a string $$T$$ into a compact data structure that supports fast LCE queries. An LCE query takes a pair $$(i,j)$$ of indices in $$T$$ and returns the length of the longest common prefix of the suffixes of $$T$$ starting at positions $$i$$ and $$j$$. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let $$n$$ be the length of $$T$$. Given a parameter $$\tau$$, $$1 \leq \tau \leq n$$, we show how to achieve either $$O(\frac{n}{\sqrt{\tau}})$$ space and $$O(\tau)$$ query time, or $$O(\frac{n}{\tau})$$ space and $$O(\tau \log({|\text{LCE}(i,j)|}/{\tau}))$$ query time, where $$|\text{LCE}(i,j)|$$ denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when $$\tau=1$$ or $$\tau=n$$. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal.
• String Indexing for Patterns with Wildcards (pdf)
with Philip Bille, Inge Li Gørtz and Søren Vind
Theory of Computing Systems (2013), doi: 10.1007/s00224-013-9498-4
• Conference version: String Indexing for Patterns with Wildcards (pdf) (slides)
with Philip Bille, Inge Li Gørtz and Søren Vind
Proc. 13th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2012)
Abstract: We consider the problem of indexing a string $$t$$ of length $$n$$ to report the occurrences of a query pattern $$p$$ containing $$m$$ characters and $$j$$ wildcards. Let $$occ$$ be the number of occurrences of $$p$$ in $$t$$, and $$\sigma$$ the size of the alphabet. We obtain the following results.
• A linear space index with query time $$O(m+\sigma^j \log \log n + occ)$$. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time $$\Theta(jn)$$ in the worst case.
• An index with query time $$O(m+j+occ)$$ using space $$O(\sigma^{k^2} n \log^k \log n)$$, where $$k$$ is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.
• A time-space trade-off, generalizing the index described by Cole et al. [STOC 2004].
Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.
• String Matching with Variable Length Gaps (pdf)
with Philip Bille, Inge Li Gørtz and David Kofoed Wind
Theoretical Computer Science (2012), doi:10.1016/j.tcs.2012.03.029
• Conference version: String Matching with Variable Length Gaps (pdf) (slides)
with Philip Bille, Inge Li Gørtz and David Kofoed Wind
Proc. 17th Symposium on String Processing and Information Retrieval (SPIRE 2010)
Abstract: We consider string matching with variable length gaps. Given a string $$T$$ and a pattern $$P$$ consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in $$T$$ that match $$P$$. This problem is a basic primitive in computational biology applications. Let $$m$$ and $$n$$ be the lengths of $$P$$ and $$T$$, respectively, and let $$k$$ be the number of strings in $$P$$. We present a new algorithm achieving time $$O(n\log k + m +\alpha)$$ and space $$O(m + A)$$, where $$A$$ is the sum of the lower bounds of the lengths of the gaps in $$P$$ and $$\alpha$$ is the total number of occurrences of the strings in $$P$$ within $$T$$. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of $$m$$, $$n$$, $$k$$, $$A$$, and $$\alpha$$. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in $$P$$ for every match of the pattern.
• Supplementary Notes for Graph Theory I ISBN-13: 9788770782180 (current version)
with David Kofoed Wind
Course Notes for 01227 Graph Theory I
• Master's Thesis: String Indexing for Patterns with Wildcards (pdf)
with Søren Vind
Technical University of Denmark, August 8, 2011
Supervised by Philip Bille and Inge Li Gørtz
Abstract: We consider the problem of indexing a string $$t$$ of length $$n$$ to report the occurrences of a query pattern $$p$$ containing $$m$$ characters and $$j$$ wildcards. Let $$occ$$ be the number of occurrences of $$p$$ in $$t$$, and $$\sigma$$ the size of the alphabet. We obtain the following results.
• A linear space index with query time $$O(m+\sigma^j \log \log n + occ)$$. This significantly improves the previously best known linear space index described by Lam et al. [ISAAC 2007], which requires query time $$\Theta(jn)$$ in the worst case.
• An index with optimal query time $$O(m+j+occ)$$ using space $$O(\sigma^{k^2} n \log^k \log n)$$, where $$k$$ is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.
• A time-space trade-off for the problem which generalizes the index described by Cole et al. [STOC 2004].
The Longest Common Prefix (LCP) data structure introduced by Cole et al. is a key component in our results. We give a detailed explanation and show several new properties of the LCP data structure. Most importantly, we show that not only suffixes, but arbitrary sets of substrings of $$t$$, can be queried, and that unrooted LCP queries can be performed in linear space. Our results are obtained by combining the new properties of the LCP data structure with well-known and new techniques. In particular, we introduce a generalization of the heavy-path decomposition, which could be of independent interest. Finally, we extend our results to allow variable length gaps or optional wildcards in the query pattern, improving upon the only previously known index for this problem by Lam et al. [ISAAC 2007].

My neighbors

The best thing about research is that you get to work with some admirable and brilliant people. Here's my current neighbors in the Erdös collaboration graph: Mika Amit, Philip Bille, Søren Bøg, Patrick Hagge Cording, Johannes Fischer, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Tomasz Kociumaka, Tsvi Kopelowitz, Moshe Lewenstein, Benjamin Sach, Frederik Rye Skjoldjensen, Tatiana Starikovskaya, Morten Stöckel, Søren Vind and David Kofoed Wind.