
Automated design of efficient search schemes for lossless approximate pattern matching
(2024)
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024.
In Lecture Notes in Computer Science
14758.
p.164-184
- Author
- Luca Renders (UGent) , Lore Depuydt (UGent) , Sven Rahmann and Jan Fostier (UGent)
- Organization
- Project
- Abstract
- We present a novel method for the automated design of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches, each of which specifies lower and upper bounds on the number of errors in each part of a partitioned search pattern, and the processing order of these parts. Collectively, these searches guarantee that all approximate occurrences of a search pattern up to a predefined number of k errors are identified. Because generating efficient search schemes is increasingly computationally expensive for larger k, search schemes have been proposed in literature for only up to k = 4 errors. To design search schemes allowing more errors, we combine a greedy algorithm and a new Integer Linear Programming (ILP) formulation. Efficient, ILP-optimal search schemes for up to k = 7 errors are proposed and shown to outperform alternative strategies, both in theory and in practice. Additionally, we propose a technique to dynamically select an appropriate search scheme given a specific search pattern. These combined approaches result in reductions of up to 53% in runtime for higher values of k. We introduce Hato, an open-source software tool (AGPL-3.0 license) to automatically generate search schemes. It implements the greedy algorithm and solves the ILP formulation using CPLEX. Furthermore, we present Columba 1.2, an open-source lossless read mapper (AGPL-3.0 license) implemented in C++. Columba can identify all approximate occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in 24 s (maximum edit distance of 4) and in 75 s (edit distance of 6) using a single CPU core, thereby outperforming existing state-of-the-art tools for lossless approximate matching by a large margin.
- Keywords
- search scheme, integer linear program, approximate pattern matching, sequence alignment, ALIGNMENT, ACCURATE
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 474.70 KB
-
8548 acc.pdf
- full text (Accepted manuscript)
- |
- open access
- |
- |
- 431.09 KB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01HYMV2QMMD4067CCK02YX88K6
- MLA
- Renders, Luca, et al. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, vol. 14758, Springer Nature Switzerland, 2024, pp. 164–84, doi:10.1007/978-1-0716-3989-4_11.
- APA
- Renders, L., Depuydt, L., Rahmann, S., & Fostier, J. (2024). Automated design of efficient search schemes for lossless approximate pattern matching. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758, 164–184. https://doi.org/10.1007/978-1-0716-3989-4_11
- Chicago author-date
- Renders, Luca, Lore Depuydt, Sven Rahmann, and Jan Fostier. 2024. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” In RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758:164–84. Springer Nature Switzerland. https://doi.org/10.1007/978-1-0716-3989-4_11.
- Chicago author-date (all authors)
- Renders, Luca, Lore Depuydt, Sven Rahmann, and Jan Fostier. 2024. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” In RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758:164–184. Springer Nature Switzerland. doi:10.1007/978-1-0716-3989-4_11.
- Vancouver
- 1.Renders L, Depuydt L, Rahmann S, Fostier J. Automated design of efficient search schemes for lossless approximate pattern matching. In: RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024. Springer Nature Switzerland; 2024. p. 164–84.
- IEEE
- [1]L. Renders, L. Depuydt, S. Rahmann, and J. Fostier, “Automated design of efficient search schemes for lossless approximate pattern matching,” in RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, Cambridge, MA, USA, 2024, vol. 14758, pp. 164–184.
@inproceedings{01HYMV2QMMD4067CCK02YX88K6, abstract = {{We present a novel method for the automated design of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches, each of which specifies lower and upper bounds on the number of errors in each part of a partitioned search pattern, and the processing order of these parts. Collectively, these searches guarantee that all approximate occurrences of a search pattern up to a predefined number of k errors are identified. Because generating efficient search schemes is increasingly computationally expensive for larger k, search schemes have been proposed in literature for only up to k = 4 errors. To design search schemes allowing more errors, we combine a greedy algorithm and a new Integer Linear Programming (ILP) formulation. Efficient, ILP-optimal search schemes for up to k = 7 errors are proposed and shown to outperform alternative strategies, both in theory and in practice. Additionally, we propose a technique to dynamically select an appropriate search scheme given a specific search pattern. These combined approaches result in reductions of up to 53% in runtime for higher values of k. We introduce Hato, an open-source software tool (AGPL-3.0 license) to automatically generate search schemes. It implements the greedy algorithm and solves the ILP formulation using CPLEX. Furthermore, we present Columba 1.2, an open-source lossless read mapper (AGPL-3.0 license) implemented in C++. Columba can identify all approximate occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in 24 s (maximum edit distance of 4) and in 75 s (edit distance of 6) using a single CPU core, thereby outperforming existing state-of-the-art tools for lossless approximate matching by a large margin.}}, author = {{Renders, Luca and Depuydt, Lore and Rahmann, Sven and Fostier, Jan}}, booktitle = {{RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024}}, isbn = {{9781071639887}}, issn = {{0302-9743}}, keywords = {{search scheme,integer linear program,approximate pattern matching,sequence alignment,ALIGNMENT,ACCURATE}}, language = {{eng}}, location = {{Cambridge, MA, USA}}, pages = {{164--184}}, publisher = {{Springer Nature Switzerland}}, title = {{Automated design of efficient search schemes for lossless approximate pattern matching}}, url = {{http://doi.org/10.1007/978-1-0716-3989-4_11}}, volume = {{14758}}, year = {{2024}}, }
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: