Advanced search
2 files | 905.79 KB Add to list

Automated design of efficient search schemes for lossless approximate pattern matching

Luca Renders (UGent) , Lore Depuydt (UGent) , Sven Rahmann and Jan Fostier (UGent)
Author
Organization
Project
Abstract
We present a novel method for the automated design of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches, each of which specifies lower and upper bounds on the number of errors in each part of a partitioned search pattern, and the processing order of these parts. Collectively, these searches guarantee that all approximate occurrences of a search pattern up to a predefined number of k errors are identified. Because generating efficient search schemes is increasingly computationally expensive for larger k, search schemes have been proposed in literature for only up to k = 4 errors. To design search schemes allowing more errors, we combine a greedy algorithm and a new Integer Linear Programming (ILP) formulation. Efficient, ILP-optimal search schemes for up to k = 7 errors are proposed and shown to outperform alternative strategies, both in theory and in practice. Additionally, we propose a technique to dynamically select an appropriate search scheme given a specific search pattern. These combined approaches result in reductions of up to 53% in runtime for higher values of k. We introduce Hato, an open-source software tool (AGPL-3.0 license) to automatically generate search schemes. It implements the greedy algorithm and solves the ILP formulation using CPLEX. Furthermore, we present Columba 1.2, an open-source lossless read mapper (AGPL-3.0 license) implemented in C++. Columba can identify all approximate occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in 24 s (maximum edit distance of 4) and in 75 s (edit distance of 6) using a single CPU core, thereby outperforming existing state-of-the-art tools for lossless approximate matching by a large margin.
Keywords
search scheme, integer linear program, approximate pattern matching, sequence alignment, ALIGNMENT, ACCURATE

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 474.70 KB
  • 8548 acc.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 431.09 KB

Citation

Please use this url to cite or link to this publication:

MLA
Renders, Luca, et al. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, vol. 14758, Springer Nature Switzerland, 2024, pp. 164–84, doi:10.1007/978-1-0716-3989-4_11.
APA
Renders, L., Depuydt, L., Rahmann, S., & Fostier, J. (2024). Automated design of efficient search schemes for lossless approximate pattern matching. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758, 164–184. https://doi.org/10.1007/978-1-0716-3989-4_11
Chicago author-date
Renders, Luca, Lore Depuydt, Sven Rahmann, and Jan Fostier. 2024. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” In RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758:164–84. Springer Nature Switzerland. https://doi.org/10.1007/978-1-0716-3989-4_11.
Chicago author-date (all authors)
Renders, Luca, Lore Depuydt, Sven Rahmann, and Jan Fostier. 2024. “Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching.” In RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 14758:164–184. Springer Nature Switzerland. doi:10.1007/978-1-0716-3989-4_11.
Vancouver
1.
Renders L, Depuydt L, Rahmann S, Fostier J. Automated design of efficient search schemes for lossless approximate pattern matching. In: RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024. Springer Nature Switzerland; 2024. p. 164–84.
IEEE
[1]
L. Renders, L. Depuydt, S. Rahmann, and J. Fostier, “Automated design of efficient search schemes for lossless approximate pattern matching,” in RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, Cambridge, MA, USA, 2024, vol. 14758, pp. 164–184.
@inproceedings{01HYMV2QMMD4067CCK02YX88K6,
  abstract     = {{We present a novel method for the automated design of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches, each of which specifies lower and upper bounds on the number of errors in each part of a partitioned search pattern, and the processing order of these parts. Collectively, these searches guarantee that all approximate occurrences of a search pattern up to a predefined number of k errors are identified. Because generating efficient search schemes is increasingly computationally expensive for larger k, search schemes have been proposed in literature for only up to k = 4 errors.

To design search schemes allowing more errors, we combine a greedy algorithm and a new Integer Linear Programming (ILP) formulation. Efficient, ILP-optimal search schemes for up to k = 7 errors are proposed and shown to outperform alternative strategies, both in theory and in practice. Additionally, we propose a technique to dynamically select an appropriate search scheme given a specific search pattern. These combined approaches result in reductions of up to 53% in runtime for higher values of k.

We introduce Hato, an open-source software tool (AGPL-3.0 license) to automatically generate search schemes. It implements the greedy algorithm and solves the ILP formulation using CPLEX. Furthermore, we present Columba 1.2, an open-source lossless read mapper (AGPL-3.0 license) implemented in C++. Columba can identify all approximate occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in 24 s (maximum edit distance of 4) and in 75 s (edit distance of 6) using a single CPU core, thereby outperforming existing state-of-the-art tools for lossless approximate matching by a large margin.}},
  author       = {{Renders, Luca and Depuydt, Lore and Rahmann, Sven and Fostier, Jan}},
  booktitle    = {{RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024}},
  isbn         = {{9781071639887}},
  issn         = {{0302-9743}},
  keywords     = {{search scheme,integer linear program,approximate pattern matching,sequence alignment,ALIGNMENT,ACCURATE}},
  language     = {{eng}},
  location     = {{Cambridge, MA, USA}},
  pages        = {{164--184}},
  publisher    = {{Springer Nature Switzerland}},
  title        = {{Automated design of efficient search schemes for lossless approximate pattern matching}},
  url          = {{http://doi.org/10.1007/978-1-0716-3989-4_11}},
  volume       = {{14758}},
  year         = {{2024}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: