Ghent University Academic Bibliography

Advanced

Unravelling the complexity of eukaryotic genomes through gene annotation and evolutionary analysis

Lieven Sterck UGent (2008)
abstract
The genome sequence of eukaryotic species is being determined making use of two intrinsically different methods. These are the whole genome shotgun method (WGS) and the clone-by-clone approach. In the WGS method the DNA is broken up in random fragments of different sizes and fragments of approximately 3000 or 10000 bases are selected. From each of these fragments 2 paired-end reads are produced. For a typical eukaryotic genome project, millions of these reads are generated in an attempt to cover the whole genome. Afterwards, these different reads are assembled to resemble the original sequence. In contrast, the clone-by-clone approach uses a highly organised protocol to fractionate the DNA into fragments. By making use of restriction enzymes the DNA is cut into pieces of around 150Kbp. The complete collection of these pieces is referred to as a BAC library. These fragments are then fingerprinted positioning them on the genome. Then a minimum tiling path is constructed in order to sequence as few DNA fragments as possible. Once the genomic sequence of an organism has been determined, one still needs to locate the biologically functional features in this genome. This step is called genome annotation which is mainly focused on predicting the location and structure of the (protein-coding) genes present in the genome. The approaches used to achieve this can be grouped in 3 categories. Namely, the ab initio, or intrinsic, methods which try to predict genes based on information gained from the genomic sequence itself, the extrinsic approaches that primarily use protein and transcript alignments and the comparative approach that is based on genomic alignments between different genomes. However, today the trend is to use integrative prediction tools that combine the information of all 3 general approaches in order to propose gene models that are maximally consistent with the provided data. One such example of an integrative prediction tool is EuGène. It was designed to mimic as much as possible a human annotator at work and therefore makes use of state of the art mathematical models and computational methods. These models are however specific for each genome and thus the software needs to be trained for every genome. This training is a tedious process that requires the manual construction of a training set of curated gene models and the fine tuning of the parameters that control the behaviour of EuGène. Although being an important milestone, an annotated genomic sequence is not an end point. It is the starting point of a whole range of downstream bioinformatics analyses. For instance, annotated genomes are used to elucidate the evolutionary past of an organism of which one aspect is the duplication history of a genome. When the genome sequence is available collinear regions within or between genomes can be detected and investigated. In an alternative approach the time of origin of paralogous gene pairs using the number of synonymous substitutions per synonymous site between paralogs as a measure, can Summary / Samenvatting 174 be estimated. These dates can be plotted as age histograms of duplicated genes that can be analysed for the presence of sharp peaks. These peaks reflect potential large scale duplication events that have happened in the evolutionary history of a genome. Populus trichocarpa, the model species for tree biotechnology, is the first tree and only the third plant species for which the genomic sequence has been determined. A WGS approach was chosen to sequence it’s ~500Mbp genome resulting in an assembly that contains more than 410Mbp of the 19 chromosomes. Annotation of the genome was done by different tools, one of which was EuGène. Combining the results of the different annotation efforts resulted in a predicted reference set of 45555 genes and potentially another 4000 genes, for which there is some experimental evidence, have to be added to this set. Although poplar obviously contains more genes than Arabidopsis, the relative frequency of protein domains is very similar. Evolutionary analysis of the genome clearly indicates a relatively recent duplication of the genome that is not shared with Arabidopsis. Almost the complete genome is still found in duplicated region which is probably due to low evolutionary rates, resulting in less gene loss in poplar. A sequencing effort is also ongoing for Ectocarpus siliculosus. This is a small multicellular brown algae for which the genome is sequenced by the GenoScope in France. In this project, again, the WGS method was applied to determine the genome with a relatively small size of 214Mbp. Few information on the biology of this species is available causing several issues in the assembly (eg. no genetic map information) and subsequent annotation of the sequence. The annotation was performed with EuGène, specifically trained for the Ectocarpus genome, and produced 37646 genes. In depth analysis of this genome is still ongoing but preliminary research already provided some interesting results. The genes in Ectocarpus have many exons (>10/gene) separated by large introns. An initial analysis of the genome history reveals that a large scale duplication event has occurred. Further research is needed to investigate whether this event might be linked to important evolutionary events, particularly the acquisition of multicellularity. To boost the manual curation of the predicted genes and to accommodate to specific demands of the consortium, a new type of genome annotation portal has been developed. The clone-by-clone sequencing approach is less popular than the WGS but is still being used, for instance to sequence the genome of the legume Medicago truncatula. The choice to use the clone-by-clone approach is driven by the wish to produce a high quality reference genome for further research in the agriculturally important legume family. The workload to sequence the heterochromation fraction of the 8 chromosomes is distributed over several sequencing centres. Also this project is still ongoing but has entered his final year of sequencing. For the moment 183 Mbp of non-redundant sequence is available capturing > 55% of the gene space. On this sequence, more than 40000 gene models have been Summary / Samenvatting 175 predicted in a coordinated effort of several annotation teams. Remarkable in this set is the presence of very small genes, many of which seem to be expressed. Genome organisation in Medicago is remarkable mainly because of a region with a higher than average number of transposons (chromosome 6) and a region with well conserved remnants of a duplication event (between chromosome 5 and 8). A genome duplication, of which scattered evidence can be found, happened early in the evolution of the legumes. Evolutionary analysis of a species can also be conducted in the absence or prior to the availability of the complete genome sequence. By analysing sets of unigenes constructed from ESTs of several different poplar species, it was shown that a genome duplication has occurred before the radiation of the species within the genus Populus. The estimated date of this event is in clear disagreement with the fossil records that are available. This resulted in the hypothesis that the evolutionary rate of Populus was slower than was reported for other species and that the date should therefore be older. Hypotheses about the evolutionary past of organisms can not only be formulated by analysing a single genome but also through the comparison of several genomes. When comparing the available genome sequence of Medicago and Lotus japonicus questions related to the evolution of the legumes could be addressed. The genome duplication event in the legumes, proposed by earlier research was given a more precise timing. The comparison also revealed that the extent of synteny between Medicago and Lotus is still extensive, in some cases extending to almost whole chromosomes despite the difference in chromosome number. The genome sequences of eukaryotic species have a complex organisation, shaped through a rampant history of duplication events. Also the diverse variation in number and structure of genetic elements such as protein-coding genes and transposable elements adds to the complexity. Consequently, the analysis of eukaryotic genomes is intrinsically difficult even when tackled with state of the art computational methods. However, the increasing number of genome projects will provide the data to eventually unravel the complexity of eukaryotic genomes.
Please use this url to cite or link to this publication:
author
promoter
UGent
organization
year
type
dissertation (monograph)
subject
pages
213 pages
publisher
Ghent University. Faculty of Sciences
place of publication
Ghent, Belgium
defense location
Zwijnaarde : Technologiepark (FSVM building)
defense date
2008-06-27 14:00
language
English
UGent publication?
yes
classification
D1
additional info
dissertation consists of copyrighted material
copyright statement
I have transferred the copyright for this publication to the publisher
id
3007649
handle
http://hdl.handle.net/1854/LU-3007649
date created
2012-10-05 14:43:13
date last changed
2012-10-08 14:55:20
@phdthesis{3007649,
  abstract     = {The genome sequence of eukaryotic species is being determined making use of two intrinsically different methods. These are the whole genome shotgun method (WGS) and the clone-by-clone approach. In the WGS method the DNA is broken up in random fragments of different sizes and fragments of approximately 3000 or 10000 bases are selected. From each of these fragments 2 paired-end reads are produced. For a typical eukaryotic genome project, millions of these reads are generated in an attempt to cover the whole genome. Afterwards, these different reads are assembled to resemble the original sequence. In contrast, the clone-by-clone approach uses a highly organised protocol to fractionate the DNA into fragments. By making use of restriction enzymes the DNA is cut into pieces of around 150Kbp. The complete collection of these pieces is referred to as a BAC library. These fragments are then fingerprinted positioning them on the genome. Then a minimum tiling path is constructed in order to sequence as few DNA fragments as possible. Once the genomic sequence of an organism has been determined, one still needs to locate the biologically functional features in this genome. This step is called genome annotation which is mainly focused on predicting the location and structure of the (protein-coding) genes present in the genome. The approaches used to achieve this can be grouped in 3 categories. Namely, the ab initio, or intrinsic, methods which try to predict genes based on information gained from the genomic sequence itself, the extrinsic approaches that primarily use protein and transcript alignments and the comparative approach that is based on genomic alignments between different genomes. However, today the trend is to use integrative prediction tools that combine the information of all 3 general approaches in order to propose gene models that are maximally consistent with the provided data. One such example of an integrative prediction tool is EuG{\`e}ne. It was designed to mimic as much as possible a human annotator at work and therefore makes use of state of the art mathematical models and computational methods. These models are however specific for each genome and thus the software needs to be trained for every genome. This training is a tedious process that requires the manual construction of a training set of curated gene models and the fine tuning of the parameters that control the behaviour of EuG{\`e}ne. Although being an important milestone, an annotated genomic sequence is not an end point. It is the starting point of a whole range of downstream bioinformatics analyses. For instance, annotated genomes are used to elucidate the evolutionary past of an organism of which one aspect is the duplication history of a genome. When the genome sequence is available collinear regions within or between genomes can be detected and investigated. In an alternative approach the time of origin of paralogous gene pairs using the number of synonymous substitutions per synonymous site between paralogs as a measure, can Summary / Samenvatting 174 be estimated. These dates can be plotted as age histograms of duplicated genes that can be analysed for the presence of sharp peaks. These peaks reflect potential large scale duplication events that have happened in the evolutionary history of a genome. Populus trichocarpa, the model species for tree biotechnology, is the first tree and only the third plant species for which the genomic sequence has been determined. A WGS approach was chosen to sequence it{\textquoteright}s {\texttildelow}500Mbp genome resulting in an assembly that contains more than 410Mbp of the 19 chromosomes. Annotation of the genome was done by different tools, one of which was EuG{\`e}ne. Combining the results of the different annotation efforts resulted in a predicted reference set of 45555 genes and potentially another 4000 genes, for which there is some experimental evidence, have to be added to this set. Although poplar obviously contains more genes than Arabidopsis, the relative frequency of protein domains is very similar. Evolutionary analysis of the genome clearly indicates a relatively recent duplication of the genome that is not shared with Arabidopsis. Almost the complete genome is still found in duplicated region which is probably due to low evolutionary rates, resulting in less gene loss in poplar. A sequencing effort is also ongoing for Ectocarpus siliculosus. This is a small multicellular brown algae for which the genome is sequenced by the GenoScope in France. In this project, again, the WGS method was applied to determine the genome with a relatively small size of 214Mbp. Few information on the biology of this species is available causing several issues in the assembly (eg. no genetic map information) and subsequent annotation of the sequence. The annotation was performed with EuG{\`e}ne, specifically trained for the Ectocarpus genome, and produced 37646 genes. In depth analysis of this genome is still ongoing but preliminary research already provided some interesting results. The genes in Ectocarpus have many exons ({\textrangle}10/gene) separated by large introns. An initial analysis of the genome history reveals that a large scale duplication event has occurred. Further research is needed to investigate whether this event might be linked to important evolutionary events, particularly the acquisition of multicellularity. To boost the manual curation of the predicted genes and to accommodate to specific demands of the consortium, a new type of genome annotation portal has been developed. The clone-by-clone sequencing approach is less popular than the WGS but is still being used, for instance to sequence the genome of the legume Medicago truncatula. The choice to use the clone-by-clone approach is driven by the wish to produce a high quality reference genome for further research in the agriculturally important legume family. The workload to sequence the heterochromation fraction of the 8 chromosomes is distributed over several sequencing centres. Also this project is still ongoing but has entered his final year of sequencing. For the moment 183 Mbp of non-redundant sequence is available capturing {\textrangle} 55\% of the gene space. On this sequence, more than 40000 gene models have been Summary / Samenvatting 175 predicted in a coordinated effort of several annotation teams. Remarkable in this set is the presence of very small genes, many of which seem to be expressed. Genome organisation in Medicago is remarkable mainly because of a region with a higher than average number of transposons (chromosome 6) and a region with well conserved remnants of a duplication event (between chromosome 5 and 8). A genome duplication, of which scattered evidence can be found, happened early in the evolution of the legumes. Evolutionary analysis of a species can also be conducted in the absence or prior to the availability of the complete genome sequence. By analysing sets of unigenes constructed from ESTs of several different poplar species, it was shown that a genome duplication has occurred before the radiation of the species within the genus Populus. The estimated date of this event is in clear disagreement with the fossil records that are available. This resulted in the hypothesis that the evolutionary rate of Populus was slower than was reported for other species and that the date should therefore be older. Hypotheses about the evolutionary past of organisms can not only be formulated by analysing a single genome but also through the comparison of several genomes. When comparing the available genome sequence of Medicago and Lotus japonicus questions related to the evolution of the legumes could be addressed. The genome duplication event in the legumes, proposed by earlier research was given a more precise timing. The comparison also revealed that the extent of synteny between Medicago and Lotus is still extensive, in some cases extending to almost whole chromosomes despite the difference in chromosome number. The genome sequences of eukaryotic species have a complex organisation, shaped through a rampant history of duplication events. Also the diverse variation in number and structure of genetic elements such as protein-coding genes and transposable elements adds to the complexity. Consequently, the analysis of eukaryotic genomes is intrinsically difficult even when tackled with state of the art computational methods. However, the increasing number of genome projects will provide the data to eventually unravel the complexity of eukaryotic genomes.},
  author       = {Sterck, Lieven},
  language     = {eng},
  pages        = {213},
  publisher    = {Ghent University. Faculty of Sciences},
  school       = {Ghent University},
  title        = {Unravelling the complexity of eukaryotic genomes through gene annotation and evolutionary analysis},
  year         = {2008},
}

Chicago
Sterck, Lieven. 2008. “Unravelling the Complexity of Eukaryotic Genomes Through Gene Annotation and Evolutionary Analysis”. Ghent, Belgium: Ghent University. Faculty of Sciences.
APA
Sterck, L. (2008). Unravelling the complexity of eukaryotic genomes through gene annotation and evolutionary analysis. Ghent University. Faculty of Sciences, Ghent, Belgium.
Vancouver
1.
Sterck L. Unravelling the complexity of eukaryotic genomes through gene annotation and evolutionary analysis. [Ghent, Belgium]: Ghent University. Faculty of Sciences; 2008.
MLA
Sterck, Lieven. “Unravelling the Complexity of Eukaryotic Genomes Through Gene Annotation and Evolutionary Analysis.” 2008 : n. pag. Print.