Ghent University Academic Bibliography

Advanced

ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering

Bie Verbist, Lieven Clement UGent, Joke Reumers, Kim Thys, Alexander Vapirev, Willem Talloen, Yves Wetzels, Joris Meys UGent, Jeroen Aerssens, Luc Bijnens, et al. (2015) BMC BIOINFORMATICS. 16.
abstract
Background: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Results: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. Conclusions: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.
Please use this url to cite or link to this publication:
author
organization
year
type
journalArticle (original)
publication status
published
subject
keyword
Illumina sequencing, Codon, Second best base call, Model-based clustering, Viral quasispecies, HAPLOTYPE RECONSTRUCTION, GENETIC DIVERSITY, ERROR-CORRECTION, QUALITY SCORES, HEPATITIS-C, GENERATION, SAMPLE
journal title
BMC BIOINFORMATICS
BMC Bioinformatics
volume
16
article number
59
pages
11 pages
Web of Science type
Article
Web of Science id
000352474000001
JCR category
MATHEMATICAL & COMPUTATIONAL BIOLOGY
JCR impact factor
2.435 (2015)
JCR rank
10/56 (2015)
JCR quartile
1 (2015)
ISSN
1471-2105
DOI
10.1186/s12859-015-0458-7
project
Bioinformatics: from nucleotids to networks (N2N)
project
Bioinformatics: from nucleotids to networks (N2N)
language
English
UGent publication?
yes
classification
A1
copyright statement
I have retained and own the full copyright for this publication
id
7180099
handle
http://hdl.handle.net/1854/LU-7180099
date created
2016-04-12 12:09:01
date last changed
2016-12-21 15:42:50
@article{7180099,
  abstract     = {Background: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. 
Results: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5\%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4\%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4\% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. 
Conclusions: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.},
  articleno    = {59},
  author       = {Verbist, Bie and Clement, Lieven and Reumers, Joke and Thys, Kim and Vapirev, Alexander and Talloen, Willem and Wetzels, Yves and Meys, Joris and Aerssens, Jeroen and Bijnens, Luc and Thas, Olivier},
  issn         = {1471-2105},
  journal      = {BMC BIOINFORMATICS},
  keyword      = {Illumina sequencing,Codon,Second best base call,Model-based clustering,Viral quasispecies,HAPLOTYPE RECONSTRUCTION,GENETIC DIVERSITY,ERROR-CORRECTION,QUALITY SCORES,HEPATITIS-C,GENERATION,SAMPLE},
  language     = {eng},
  pages        = {11},
  title        = {ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering},
  url          = {http://dx.doi.org/10.1186/s12859-015-0458-7},
  volume       = {16},
  year         = {2015},
}

Chicago
Verbist, Bie, Lieven Clement, Joke Reumers, Kim Thys, Alexander Vapirev, Willem Talloen, Yves Wetzels, et al. 2015. “ViVaMBC: Estimating Viral Sequence Variation in Complex Populations from Illumina Deep-sequencing Data Using Model-based Clustering.” Bmc Bioinformatics 16.
APA
Verbist, Bie, Clement, L., Reumers, J., Thys, K., Vapirev, A., Talloen, W., Wetzels, Y., et al. (2015). ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC BIOINFORMATICS, 16.
Vancouver
1.
Verbist B, Clement L, Reumers J, Thys K, Vapirev A, Talloen W, et al. ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC BIOINFORMATICS. 2015;16.
MLA
Verbist, Bie, Lieven Clement, Joke Reumers, et al. “ViVaMBC: Estimating Viral Sequence Variation in Complex Populations from Illumina Deep-sequencing Data Using Model-based Clustering.” BMC BIOINFORMATICS 16 (2015): n. pag. Print.