Advanced search
1 file | 346.20 KB Add to list

AFRESh : an adaptive framework for compression of reads and assembled sequences with random access functionality

Tom Paridaens (UGent) , Glenn Van Wallendael (UGent) , Wesley De Neve (UGent) and Peter Lambert (UGent)
(2017) BIOINFORMATICS. 33(10). p.1464-1472
Author
Organization
Abstract
Motivation: The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set...) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression. Results: We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting.
Keywords
IBCN

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 346.20 KB

Citation

Please use this url to cite or link to this publication:

MLA
Paridaens, Tom et al. “AFRESh : an Adaptive Framework for Compression of Reads and Assembled Sequences with Random Access Functionality.” BIOINFORMATICS 33.10 (2017): 1464–1472. Print.
APA
Paridaens, T., Van Wallendael, G., De Neve, W., & Lambert, P. (2017). AFRESh : an adaptive framework for compression of reads and assembled sequences with random access functionality. BIOINFORMATICS, 33(10), 1464–1472.
Chicago author-date
Paridaens, Tom, Glenn Van Wallendael, Wesley De Neve, and Peter Lambert. 2017. “AFRESh : an Adaptive Framework for Compression of Reads and Assembled Sequences with Random Access Functionality.” Bioinformatics 33 (10): 1464–1472.
Chicago author-date (all authors)
Paridaens, Tom, Glenn Van Wallendael, Wesley De Neve, and Peter Lambert. 2017. “AFRESh : an Adaptive Framework for Compression of Reads and Assembled Sequences with Random Access Functionality.” Bioinformatics 33 (10): 1464–1472.
Vancouver
1.
Paridaens T, Van Wallendael G, De Neve W, Lambert P. AFRESh : an adaptive framework for compression of reads and assembled sequences with random access functionality. BIOINFORMATICS. 2017;33(10):1464–72.
IEEE
[1]
T. Paridaens, G. Van Wallendael, W. De Neve, and P. Lambert, “AFRESh : an adaptive framework for compression of reads and assembled sequences with random access functionality,” BIOINFORMATICS, vol. 33, no. 10, pp. 1464–1472, 2017.
@article{8524498,
  abstract     = {Motivation: The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set...) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression. Results: We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting.},
  author       = {Paridaens, Tom and Van Wallendael, Glenn and De Neve, Wesley and Lambert, Peter},
  issn         = {1367-4803},
  journal      = {BIOINFORMATICS},
  keywords     = {IBCN},
  language     = {eng},
  number       = {10},
  pages        = {1464--1472},
  title        = {AFRESh : an adaptive framework for compression of reads and assembled sequences with random access functionality},
  url          = {http://dx.doi.org/10.1093/bioinformatics/btx001},
  volume       = {33},
  year         = {2017},
}

Altmetric
View in Altmetric
Web of Science
Times cited: