Advanced search
1 file | 730.65 KB

QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles

Author
Organization
Project
Bioinformatics: from nucleotids to networks (N2N)
Abstract
Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNVHS-P80 revealed a sensitivity of 100 % (vs. 40-60 % for the existing methods) and a specificity of 100 % (vs. 98.0-99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers. Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.
Keywords
Illumina deep sequencing, Classifier model, Logistic regression, Single nucleotide variant, True variant, GENERATION SEQUENCING DATA, VIRAL POPULATIONS, GENETIC DIVERSITY, DISCOVERY, VIROLOGY, DISEASES, FORMAT, SCORES

Downloads

  • VanderBorghtEtal2015QQ-SNV.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 730.65 KB

Citation

Please use this url to cite or link to this publication:

Chicago
van der Borght, Koen, Kim Thys, Yves Wetzels, Lieven Clement, Bie Verbist, Joke Reumers, Herman van Vlijmen, and Jeroen Aerssens. 2015. “QQ-SNV: Single Nucleotide Variant Detection at Low Frequency by Comparing the Quality Quantiles.” Bmc Bioinformatics 16.
APA
van der Borght, Koen, Thys, K., Wetzels, Y., Clement, L., Verbist, B., Reumers, J., van Vlijmen, H., et al. (2015). QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles. BMC BIOINFORMATICS, 16.
Vancouver
1.
van der Borght K, Thys K, Wetzels Y, Clement L, Verbist B, Reumers J, et al. QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles. BMC BIOINFORMATICS. 2015;16.
MLA
van der Borght, Koen, Kim Thys, Yves Wetzels, et al. “QQ-SNV: Single Nucleotide Variant Detection at Low Frequency by Comparing the Quality Quantiles.” BMC BIOINFORMATICS 16 (2015): n. pag. Print.
@article{7180089,
  abstract     = {Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ({\textacutedbl}deep sequencing{\textacutedbl}), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. 
Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 \%, QQ-SNVHS-P80 revealed a sensitivity of 100 \% (vs. 40-60 \% for the existing methods) and a specificity of 100 \% (vs. 98.0-99.7 \% for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 \% were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers. 
Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.},
  articleno    = {379},
  author       = {van der Borght, Koen and Thys, Kim and Wetzels, Yves and Clement, Lieven and Verbist, Bie and Reumers, Joke and van Vlijmen, Herman and Aerssens, Jeroen},
  issn         = {1471-2105},
  journal      = {BMC BIOINFORMATICS},
  keyword      = {Illumina deep sequencing,Classifier model,Logistic regression,Single nucleotide variant,True variant,GENERATION SEQUENCING DATA,VIRAL POPULATIONS,GENETIC DIVERSITY,DISCOVERY,VIROLOGY,DISEASES,FORMAT,SCORES},
  language     = {eng},
  pages        = {14},
  title        = {QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles},
  url          = {http://dx.doi.org/10.1186/s12859-015-0812-9},
  volume       = {16},
  year         = {2015},
}

Altmetric
View in Altmetric
Web of Science
Times cited: