Advanced search
1 file | 3.91 MB

In silico approaches to studying transcriptional gene regulation: prediction of transcription factor binding sites and applications thereof

Bart Hooghe (UGent)
(2011)
Author
Promoter
(UGent) and (UGent)
Organization
Abstract
Transcription factor binding sites (TFBSs) are DNA sequences of 6 to 15 base pairs and interaction with their binding partners, the transcription factors (TFs), largely determines the observed spatiotemporal gene expression patterns. Accurate in silico identification of TFBSs could thus provide valuable support for research on transcriptional gene regulation, but this proved to be a difficult task, partly due to a lack of centralized useful data. Tools that use noisy predictions of TFBSs, however, can already aid in unraveling gene regulatory networks. • Many DNA sites are experimentally proven to be bound by a TF, but they are scattered throughout scientific literature. I joined a community-based effort to tackle the shortage of TFBS data. Collecting them and storing TFBSs in a central place was necessary to make any progress in modeling DNA binding specificity of TFs and to study transcriptional gene regulatory mechanisms. Before, during and after the three-day RegCreative jamboree, which was organized in our department (November 29th till December 1st 2006), new records were added to the new database ORegAnno. Furthermore, ontologies were discussed, as well as text-mining strategies for automation of data curation. In those discussions, the approach of ORegAnno was taken as a reference point. The database was updated to contain more data, and was featured with a publication queue that consists of papers with high potential for successful curation of one or more regulatory regions. • I helped to introduce a method that considers two sets of genes that are differentially expressed under the same environmental conditions (tissue or cell type, addition of a TF or other impulse). Such sets of genes can typically be derived from microarray experiments. The method is based on the distance difference matrix concept and simultaneously integrates statistical overrepresentation and co-occurrence of predicted TFBSs in the promoters of the genes, in order to find the secondary TFs responsible for the differential expression. A web interface to our DDM-MDS method is to be found at http://bioit.dmbr.ugent.be/TFdiff/. • Orthologous promoter sequences are commonly used to increase the specificity with which potentially functional TFBSs are recognized and to detect possibly important similarities or differences between different species. We developed ConTra (conserved TFBSs), a user-friendly web tool that allows the biologist at the bench to interactively visualize TFBSs predicted using position weight matrix (PWM) libraries, on a promoter alignment of choice. The visualization can be preceded by a simple scoring analysis to explore which TFs are the most likely to bind to the promoter of interest. The ConTra web server is available at http://bioit.dmbr.ugent.be/ConTra/. • We determined the value of using DNA structural information in sequence-based prediction of TFBSs. Based on the random forest (RF) algorithm, we created a method that utilizes DNA-sequence-dependent structural information in a flexible way. We qualitatively compared the classification accuracy of this so-called biophysical method with the accuracy of methods that use nucleotide identity information only, namely, the widely used PWM method and a so-called NPD method, which models nucleotide dependencies between positions with the same RF algorithm. Our results for five TFs with different DNA-binding domains show that the biophysical method alone performs surprisingly well. It complements the NPD method and the PWM method to some extent, and combining all three methods yields a classification accuracy that is higher than that of either method.
Keywords
structure, DNA, in silico prediction, transcription factor binding sites

Downloads

  • (...).pdf
    • full text
    • |
    • UGent only
    • |
    • PDF
    • |
    • 3.91 MB

Citation

Please use this url to cite or link to this publication:

Chicago
Hooghe, Bart. 2011. “In Silico Approaches to Studying Transcriptional Gene Regulation: Prediction of Transcription Factor Binding Sites and Applications Thereof”. Ghent, Belgium: Ghent University. Faculty of Sciences.
APA
Hooghe, B. (2011). In silico approaches to studying transcriptional gene regulation: prediction of transcription factor binding sites and applications thereof. Ghent University. Faculty of Sciences, Ghent, Belgium.
Vancouver
1.
Hooghe B. In silico approaches to studying transcriptional gene regulation: prediction of transcription factor binding sites and applications thereof. [Ghent, Belgium]: Ghent University. Faculty of Sciences; 2011.
MLA
Hooghe, Bart. “In Silico Approaches to Studying Transcriptional Gene Regulation: Prediction of Transcription Factor Binding Sites and Applications Thereof.” 2011 : n. pag. Print.
@phdthesis{1177432,
  abstract     = {Transcription factor binding sites (TFBSs) are DNA sequences of 6 to 15 base pairs and interaction with their binding partners, the transcription factors (TFs), largely determines the observed spatiotemporal gene expression patterns. Accurate in silico identification of TFBSs could thus provide valuable support for research on transcriptional gene regulation, but this proved to be a difficult task, partly due to a lack of centralized useful data. Tools that use noisy predictions of TFBSs, however, can already aid in unraveling gene regulatory networks.
• Many DNA sites are experimentally proven to be bound by a TF, but they are scattered throughout scientific literature. I joined a community-based effort to tackle the shortage of TFBS data. Collecting them and storing TFBSs in a central place was necessary to make any progress in modeling DNA binding specificity of TFs and to study  transcriptional gene regulatory mechanisms. Before, during and after the three-day RegCreative jamboree, which was organized in our department (November 29th till December 1st 2006), new records were added to the new database ORegAnno. Furthermore, ontologies were discussed, as well as text-mining strategies for automation of data curation. In those discussions, the approach of ORegAnno was taken as a reference point. The database was updated to contain more data, and was featured with a publication queue that consists of papers with high potential for successful curation of one or more regulatory regions. 
• I helped to introduce a method that considers two sets of genes that are differentially expressed under the same environmental conditions (tissue or cell type, addition of a TF or other impulse). Such sets of genes can typically be derived from microarray experiments. The method is based on the distance difference matrix concept and simultaneously integrates statistical overrepresentation and co-occurrence of predicted TFBSs in the promoters of the genes, in order to find the secondary TFs responsible for the differential expression. A web interface to our DDM-MDS method is to be found at http://bioit.dmbr.ugent.be/TFdiff/. 
• Orthologous promoter sequences are commonly used to increase the specificity with which potentially functional TFBSs are recognized and to detect possibly important similarities or differences between different species. We developed ConTra (conserved TFBSs), a user-friendly web tool that allows the biologist at the bench to interactively visualize TFBSs predicted using position weight matrix (PWM) libraries, on a promoter alignment of choice. The visualization can be preceded by a simple scoring analysis to explore which TFs are the most likely to bind to the promoter of interest. The ConTra web server is available at http://bioit.dmbr.ugent.be/ConTra/.
• We determined the value of using DNA structural information in sequence-based prediction of TFBSs. Based on the random forest (RF) algorithm, we created a method that utilizes DNA-sequence-dependent structural information in a flexible way. We qualitatively compared the classification accuracy of this so-called biophysical method with the accuracy of methods that use nucleotide identity information only, namely, the widely used PWM method and a so-called NPD method, which models nucleotide dependencies between positions with the same RF algorithm. Our results for five TFs with different DNA-binding domains show that the biophysical method alone performs surprisingly well. It complements the NPD method and the PWM method to some extent, and combining all three methods yields a classification accuracy that is higher than that of either method.},
  author       = {Hooghe, Bart},
  keywords     = {structure,DNA,in silico prediction,transcription factor binding sites},
  language     = {eng},
  pages        = {VII, 163},
  publisher    = {Ghent University. Faculty of Sciences},
  school       = {Ghent University},
  title        = {In silico approaches to studying transcriptional gene regulation: prediction of transcription factor binding sites and applications thereof},
  year         = {2011},
}