Advanced search
1 file | 193.34 KB

A high performance computing approach for genomic prediction

Arne De Coninck (UGent) , Jan Fostier (UGent) , Steven Maenhout (UGent) and Bernard De Baets (UGent)
Author
Organization
Project
HPC-UGent: the central High Performance Computing infrastructure of Ghent University
Project
Bioinformatics: from nucleotids to networks (N2N)
Abstract
In the field of genomic prediction, genotypes of animals or plants are used to predict either phenotypic properties of new crosses or breeding values (EBVs) for detecting superior parents. Since quantitative traits of importance to breeders are mostly regulated by a large number of loci (QTL), high-density SNP markers are used to genotype individuals. The most frequently applied SNP arrays for cattle consist of 50,000 SNP markers, but even genotypes with 700,000 SNPs are already available (Cole et al., 2012). Some widely used analysis methods rely on a linear mixed model backbone (Meuwissen et al., 2001), which models the SNP marker effects as random effects, drawn from a normal distribution. The estimates for the marker effects are known as BLUP, which are linear functions of the response variates. It has been shown that when no major genes contribute to the trait, Bayesian predictions and BLUP result in approximately the same prediction accuracy for the EBVs (Hayes et al., 2009; Legarra et al., 2011; Daetwyler et al., 2013). At present the number of individuals included in the genomic prediction setting is still an order of magnitude smaller than the number of genetic markers on widely used SNP arrays, causing algorithms to directly estimate EBVs, which is in this case computationally more efficient than first estimating the marker effects (VanRaden, 2008; Misztal et al., 2009; Piepho, 2009; Shen et al., 2013). Nonetheless, it has been shown theoretically (Hayes et al., 2009) that in order to increase the prediction accuracy of the EBVs for traits with a low heritability, the number of genotyped records should increase dramatically. Most widely used implementations like synbreed (Wimmer et al., 2012) and BLUPF901 are not able to handle data sets that contain more than a few thousand individuals, since they are limited by the physical memory accessible by the computing processor. We present DAIRRy-BLUP, a parallel framework that takes advantage of a distributed-memory compute cluster in order to enable the analysis of large-scale datasets. Additionally, results on simulated data illustrate that the use of such large-scale datasets is warranted as it significantly improves the prediction accuracy of EBVs and marker effects.
Keywords
genomic prediction, high performance computing, variance component estimation, distributed-memory architecture

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 193.34 KB

Citation

Please use this url to cite or link to this publication:

Chicago
De Coninck, Arne, Jan Fostier, Steven Maenhout, and Bernard De Baets. 2014. “A High Performance Computing Approach for Genomic Prediction.” In 19th National Symposium on Applied Biological Sciences, Proceedings, ed. Nicolas Gengler, 79:115–119.
APA
De Coninck, A., Fostier, J., Maenhout, S., & De Baets, B. (2014). A high performance computing approach for genomic prediction. In N. Gengler (Ed.), 19th National symposium on Applied Biological Sciences, Proceedings (Vol. 79, pp. 115–119). Presented at the 19th National symposium on Applied Biological Sciences.
Vancouver
1.
De Coninck A, Fostier J, Maenhout S, De Baets B. A high performance computing approach for genomic prediction. In: Gengler N, editor. 19th National symposium on Applied Biological Sciences, Proceedings. 2014. p. 115–9.
MLA
De Coninck, Arne, Jan Fostier, Steven Maenhout, et al. “A High Performance Computing Approach for Genomic Prediction.” 19th National Symposium on Applied Biological Sciences, Proceedings. Ed. Nicolas Gengler. Vol. 79. 2014. 115–119. Print.
@inproceedings{5929333,
  abstract     = {In the field of genomic prediction, genotypes of animals or plants are used to predict either phenotypic properties of new crosses or breeding values (EBVs) for detecting superior parents. Since quantitative traits of importance to breeders are mostly regulated by a large number of loci (QTL), high-density SNP markers are used to genotype individuals. The most frequently applied SNP arrays for cattle consist of 50,000 SNP markers, but even genotypes with 700,000 SNPs are already available (Cole et al., 2012).
Some widely used analysis methods rely on a linear mixed model backbone  (Meuwissen et al., 2001), which models the SNP marker effects as random effects, drawn from a normal distribution. The estimates for the marker effects are known as BLUP, which are linear functions of the response variates. It has been shown that when no major genes contribute to the trait, Bayesian predictions and BLUP result in approximately the same prediction accuracy for the EBVs (Hayes et al., 2009; Legarra et al., 2011; Daetwyler et al., 2013).
At present the number of individuals included in the genomic prediction setting is still an order of magnitude smaller than the number of genetic markers on widely used SNP arrays, causing algorithms to directly estimate EBVs, which is in this case computationally more efficient than first estimating the marker effects (VanRaden, 2008; Misztal et al., 2009; Piepho, 2009; Shen et al., 2013). Nonetheless, it has been shown theoretically (Hayes et al., 2009) that in order to increase the prediction accuracy of the EBVs for traits with a low heritability, the number of genotyped records should increase dramatically. Most widely used implementations like synbreed (Wimmer et al., 2012) and BLUPF901 are not able to handle data sets that contain more than a few thousand individuals, since they are limited by the physical memory accessible by the computing processor. We present DAIRRy-BLUP, a parallel framework that takes advantage of a distributed-memory compute cluster in order to enable the analysis of large-scale datasets. Additionally, results on simulated data illustrate that the use of such large-scale datasets is warranted as it significantly improves the prediction accuracy of EBVs and marker effects.},
  author       = {De Coninck, Arne and Fostier, Jan and Maenhout, Steven and De Baets, Bernard},
  booktitle    = {19th National symposium on Applied Biological Sciences, Proceedings},
  editor       = {Gengler, Nicolas},
  issn         = {1379-1176},
  keyword      = {genomic prediction,high performance computing,variance component estimation,distributed-memory architecture},
  language     = {eng},
  location     = {Gembloux, Belgium},
  number       = {1},
  pages        = {115--119},
  title        = {A high performance computing approach for genomic prediction},
  volume       = {79},
  year         = {2014},
}