Advanced search
1 file | 90.25 KB Add to list

Cleaning the GenBank Arabidopsis thaliana data set

(1996) NUCLEIC ACIDS RESEARCH. 24(2). p.316-320
Author
Organization
Abstract
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks, However, the possibilities are drastically impaired if the stored data is unreliable, During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate, More than 15% of the most important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common, It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated-also at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
Keywords
SEQUENCE, SITES

Downloads

  • 186838 Korning et al. 1996 NucleicAcidsRes24 316.pdf
    • full text
    • |
    • open access
    • |
    • PDF
    • |
    • 90.25 KB

Citation

Please use this url to cite or link to this publication:

MLA
Korning, Peter G., et al. “Cleaning the GenBank Arabidopsis Thaliana Data Set.” NUCLEIC ACIDS RESEARCH, vol. 24, no. 2, 1996, pp. 316–20, doi:10.1093/nar/24.2.316.
APA
Korning, P. G., Hebsgaard, S. M., Rouzé, P., & Brunak, S. (1996). Cleaning the GenBank Arabidopsis thaliana data set. NUCLEIC ACIDS RESEARCH, 24(2), 316–320. https://doi.org/10.1093/nar/24.2.316
Chicago author-date
Korning, Peter G, Stefan M Hebsgaard, Pierre Rouzé, and Søren Brunak. 1996. “Cleaning the GenBank Arabidopsis Thaliana Data Set.” NUCLEIC ACIDS RESEARCH 24 (2): 316–20. https://doi.org/10.1093/nar/24.2.316.
Chicago author-date (all authors)
Korning, Peter G, Stefan M Hebsgaard, Pierre Rouzé, and Søren Brunak. 1996. “Cleaning the GenBank Arabidopsis Thaliana Data Set.” NUCLEIC ACIDS RESEARCH 24 (2): 316–320. doi:10.1093/nar/24.2.316.
Vancouver
1.
Korning PG, Hebsgaard SM, Rouzé P, Brunak S. Cleaning the GenBank Arabidopsis thaliana data set. NUCLEIC ACIDS RESEARCH. 1996;24(2):316–20.
IEEE
[1]
P. G. Korning, S. M. Hebsgaard, P. Rouzé, and S. Brunak, “Cleaning the GenBank Arabidopsis thaliana data set,” NUCLEIC ACIDS RESEARCH, vol. 24, no. 2, pp. 316–320, 1996.
@article{186838,
  abstract     = {{Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks, However, the possibilities are drastically impaired if the stored data is unreliable, During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate, More than 15% of the most important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common, It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated-also at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.}},
  author       = {{Korning, Peter G and Hebsgaard, Stefan M and Rouzé, Pierre and Brunak, Søren}},
  issn         = {{0305-1048}},
  journal      = {{NUCLEIC ACIDS RESEARCH}},
  keywords     = {{SEQUENCE,SITES}},
  language     = {{eng}},
  number       = {{2}},
  pages        = {{316--320}},
  title        = {{Cleaning the GenBank Arabidopsis thaliana data set}},
  url          = {{http://doi.org/10.1093/nar/24.2.316}},
  volume       = {{24}},
  year         = {{1996}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: