Computational detection of cis -regulatory modules. (1/1011)

MOTIVATION: The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors that bind to cis-regulatory modules (CRMs) located in the neighborhood of the gene. By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription. RESULTS: Based on the hypothesis that genes that are needed in exactly the same conditions might share similar regulatory switches, we have developed a novel methodology to find CRMs in a set of coexpressed or coregulated genes. The ModuleSearcher algorithm finds for a given gene set the best scoring combination of transcription factor binding sites within a sequence window using an A(*)procedure for tree searching. To keep the level of noise low, we use DNA sequences that are most likely to contain functional cis-regulatory information, namely conserved regions between human and mouse orthologous genes. The ModuleScanner performs genomic searches with a predicted CRM or with a user-defined CRM known from the literature to find possible target genes. The validity of a set of putative targets is checked using Gene Ontology annotations. We demonstrate the use and effectiveness of the ModuleSearcher and ModuleScanner algorithms and test their specificity and sensitivity on semi-artificial data. Next, we search for a module in a cluster of gene expression profiles of human cell cycle genes. AVAILABILITY: The ModuleSearcher is available as a web service within the TOUCAN workbench for regulatory sequence analysis, which can be downloaded from http://www.esat.kuleuven.ac.be/~dna/BioI.  (+info)

Searching for statistically significant regulatory modules. (2/1011)

MOTIVATION: The regulatory machinery controlling gene expression is complex, frequently requiring multiple, simultaneous DNA-protein interactions. The rate at which a gene is transcribed may depend upon the presence or absence of a collection of transcription factors bound to the DNA near the gene. Locating transcription factor binding sites in genomic DNA is difficult because the individual sites are small and tend to occur frequently by chance. True binding sites may be identified by their tendency to occur in clusters, sometimes known as regulatory modules. RESULTS: We describe an algorithm for detecting occurrences of regulatory modules in genomic DNA. The algorithm, called mcast, takes as input a DNA database and a collection of binding site motifs that are known to operate in concert. mcast uses a motif-based hidden Markov model with several novel features. The model incorporates motif-specific p-values, thereby allowing scores from motifs of different widths and specificities to be compared directly. The p-value scoring also allows mcast to only accept motif occurrences with significance below a user-specified threshold, while still assigning better scores to motif occurrences with lower p-values. mcast can search long DNA sequences, modeling length distributions between motifs within a regulatory module, but ignoring length distributions between modules. The algorithm produces a list of predicted regulatory modules, ranked by E-value. We validate the algorithm using simulated data as well as real data sets from fruitfly and human. AVAILABILITY: http://meme.sdsc.edu/MCAST/paper  (+info)

Exploring potential target genes of signaling pathways by predicting conserved transcription factor binding sites. (3/1011)

Many cellular signaling pathways induce gene expression by activating specific transcription factor complexes. Conventional approaches to the prediction of transcription factor binding sites lead to a notoriously high number of false discoveries. To alleviate this problem, we consider only binding sites that are conserved in man-mouse genomic sequence comparisons. We employ two alternative methods for predicting binding sites: exact matches to validated binding site sequences and weight matrix scans. We then ask the question whether there is a characteristic association between a transcription factor or set thereof to a particular group of genes. Our approach is tested on genes, which are induced in dendritic cells in response to the cells' exposure to LPS. We chose this example because the underlying signaling pathways are well understood. We demonstrate the benefit of conserved predicted binding sites in interpreting the LPS experiment. Additionally, we find that both methods for the prediction of conserved binding sites complement one another. Finally, our results suggest a distinct role for SRF in the context of LPS-induced gene expression.  (+info)

Finding optimal degenerate patterns in DNA sequences. (4/1011)

MOTIVATION: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet Sigma is a sequence of subsets of Sigma. A string over IUPAC nucleic acid codes is also a degenerate pattern over Sigma = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. RESULTS: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. AVAILABILITY: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html  (+info)

Predicting genetic regulatory response using classification. (5/1011)

MOTIVATION: Studying gene regulatory mechanisms in simple model organisms through analysis of high-throughput genomic data has emerged as a central problem in computational biology. Most approaches in the literature have focused either on finding a few strong regulatory patterns or on learning descriptive models from training data. However, these approaches are not yet adequate for making accurate predictions about which genes will be up- or down-regulated in new or held-out experiments. By introducing a predictive methodology for this problem, we can use powerful tools from machine learning and assess the statistical significance of our predictions. RESULTS: We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences ('motifs') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment ('parents'). Thus, our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurements to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S.cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. We also show how to extract significant regulators, motifs and motif-regulator pairs from the learned models for various stress responses. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks. AVAILABILITY: The MLJava package is available upon request to the authors. Supplementary: Additional results are available from http://www.cs.columbia.edu/compbio/geneclass  (+info)

Inferring quantitative models of regulatory networks from expression data. (6/1011)

MOTIVATION: Genetic networks regulate key processes in living cells. Various methods have been suggested to reconstruct network architecture from gene expression data. However, most approaches are based on qualitative models that provide only rough approximations of the underlying events, and lack the quantitative aspects that are critical for understanding the proper function of biomolecular systems. RESULTS: We present fine-grained dynamical models of gene transcription and develop methods for reconstructing them from gene expression data within the framework of a generative probabilistic model. Unlike previous works, we employ quantitative transcription rates, and simultaneously estimate both the kinetic parameters that govern these rates, and the activity levels of unobserved regulators that control them. We apply our approach to expression datasets from yeast and show that we can learn the unknown regulator activity profiles, as well as the binding affinity parameters. We also introduce a novel structure learning algorithm, and demonstrate its power to accurately reconstruct the regulatory network from those datasets.  (+info)

Whole-genome analysis of temporal gene expression during foregut development. (7/1011)

We have investigated the cis-regulatory network that mediates temporal gene expression during organogenesis. Previous studies demonstrated that the organ selector gene pha-4/FoxA is critical to establish the onset of transcription of Caenorhabditis elegans foregut (pharynx) genes. Here, we discover additional cis-regulatory elements that function in combination with PHA-4. We use a computational approach to identify candidate cis-regulatory sites for genes activated either early or late during pharyngeal development. Analysis of natural or synthetic promoters reveals that six of these sites function in vivo. The newly discovered temporal elements, together with predicted PHA-4 sites, account for the onset of expression of roughly half of the pharyngeal genes examined. Moreover, combinations of temporal elements and PHA-4 sites can be used in genome-wide searches to predict pharyngeal genes, with more than 85% accuracy for their onset of expression. These findings suggest a regulatory code for temporal gene expression during foregut development and provide a means to predict gene expression patterns based solely on genomic sequence.  (+info)

Pi class glutathione S-transferase genes are regulated by Nrf 2 through an evolutionarily conserved regulatory element in zebrafish. (8/1011)

Pi class GSTs (glutathione S-transferases) are a member of the vertebrate GST family of proteins that catalyse the conjugation of GSH to electrophilic compounds. The expression of Pi class GST genes can be induced by exposure to electrophiles. We demonstrated previously that the transcription factor Nrf 2 (NF-E2 p45-related factor 2) mediates this induction, not only in mammals, but also in fish. In the present study, we have isolated the genomic region of zebrafish containing the genes gstp1 and gstp2. The regulatory regions of zebrafish gstp1 and gstp2 have been examined by GFP (green fluorescent protein)-reporter gene analyses using microinjection into zebrafish embryos. Deletion and point-mutation analyses of the gstp1 promoter showed that an ARE (antioxidant-responsive element)-like sequence is located 50 bp upstream of the transcription initiation site which is essential for Nrf 2 transactivation. Using EMSA (electrophoretic mobility-shift assay) analysis we showed that zebrafish Nrf 2-MafK heterodimer specifically bound to this sequence. All the vertebrate Pi class GST genes harbour a similar ARE-like sequence in their promoter regions. We propose that this sequence is a conserved target site for Nrf 2 in the Pi class GST genes.  (+info)