CisOrtho: a program pipeline for genome-wide identification of transcription factor target genes using phylogenetic footprinting. (57/288)

BACKGROUND: All known genomes code for a large number of transcription factors. It is important to develop methods that will reveal how these transcription factors act on a genome wide level, that is, through what target genes they exert their function. RESULTS: We describe here a program pipeline aimed at identifying transcription factor target genes in whole genomes. Starting from a consensus binding site, represented as a weight matrix, potential sites in a pre-filtered genome are identified and then further filtered by assessing conservation of the putative site in the genome of a related species, a process called phylogenetic footprinting. CisOrtho has been successfully used to identify targets for two homeodomain transcription factors in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. CONCLUSIONS: CisOrtho will identify targets of other nematode transcription factors whose DNA binding specificity is known and can be easily adapted to search other genomes for transcription factor targets.  (+info)

GeneOrder3.0: software for comparing the order of genes in pairs of small bacterial genomes. (58/288)

BACKGROUND: An increasing number of whole viral and bacterial genomes are being sequenced and deposited in public databases. In parallel to the mounting interest in whole genomes, the number of whole genome analyses software tools is also increasing. GeneOrder was originally developed to provide an analysis of genes between two genomes, allowing visualization of gene order and synteny comparisons of any small genomes. It was originally developed for comparing virus, mitochondrion and chloroplast genomes. This is now extended to small bacterial genomes of sizes less than 2 Mb. RESULTS: GeneOrder3.0 has been developed and validated successfully on several small bacterial genomes (ca. 580 kb to 1.83 Mb) archived in the NCBI GenBank database. It is an updated web-based "on-the-fly" computational tool allowing gene order and synteny comparisons of any two small bacterial genomes. Analyses of several bacterial genomes show that a large amount of gene and genome re-arrangement occurs, as seen with earlier DNA software tools. This can be displayed at the protein level using GeneOrder3.0. Whole genome alignments of genes are presented in both a table and a dot plot. This allows the detection of evolutionary more distant relationships since protein sequences are more conserved than DNA sequences. CONCLUSIONS: GeneOrder3.0 allows researchers to perform comparative analysis of gene order and synteny in genomes of sizes up to 2 Mb "on-the-fly." AVAILABILITY: http://binf.gmu.edu/genometools.html and http://pasteur.atcc.org:8050/GeneOrder3.0.  (+info)

Match/X, A gene expression pattern recognition algorithm used to identify genes which may be related to CDC2 function and cell cycle regulation. (59/288)

Large-scale microarray gene expression studies can provide insight into complex genetic networks and biological pathways. A comprehensive gene expression database was constructed using Affymetrix GeneChip microarrays and RNA isolated from more than 6,400 distinct normal and diseased human tissues. These individual patient samples were grouped into over 700 sample sets based on common tissue and disease morphologies, and each set contained averaged expression data for over 45,000 gene probe sets representing more than 33,000 known human genes. Sample sets were compared to each other in more than 750 normal vs. disease pairwise comparisons. Relative up or downregulation patterns of genes across these pairwise comparisons provided unique expression fingerprints that could be compared and matched to a gene of interest using the Match/X trade mark algorithm. This algorithm uses the kappa statistic to compute correlations between genes and calculate a distance score between a gene of interest and all other genes in the database. Using cdc2 as a query gene, we identified several hundred genes that had similar expression patterns and highly correlated distance scores. Most of these genes were known components of the cell cycle involved in G2/M progression, spindle function or chromosome arrangement. Some of the identified genes had unknown biological functions but may be related to cdc2 mediated mechanism based on their closely correlated distance scores. This algorithm may provide novel insights into unknown gene function based on correlation to expression profiles of known genes and can identify elements of cellular pathways and gene interactions in a high throughput fashion.  (+info)

Feature selection for splice site prediction: a new method using EDA-based feature ranking. (60/288)

BACKGROUND: The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. RESULTS: In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. CONCLUSION: We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.  (+info)

Multiclass discovery in array data. (61/288)

BACKGROUND: A routine goal in the analysis of microarray data is to identify genes with expression levels that correlate with known classes of experiments. In a growing number of array data sets, it has been shown that there is an over-abundance of genes that discriminate between known classes as compared to expectations for random classes. Therefore, one can search for novel classes in array data by looking for partitions of experiments for which there are an over-abundance of discriminatory genes. We have previously used such an approach in a breast cancer study. RESULTS: We describe the implementation of an unsupervised classification method for class discovery in microarray data. The method allows for discovery of more than two classes. We applied our method on two published microarray data sets: small round blue cell tumors and breast tumors. The method predicts relevant classes in the data sets with high success rates. CONCLUSIONS: We conclude that the proposed method is accurate and efficient in finding biologically relevant classes in microarray data. Additionally, the method is useful for quality control of microarray experiments. We have made the method available as a computer program.  (+info)

A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. (62/288)

BACKGROUND: The PathoLogic program constructs Pathway/Genome databases by using a genome's annotation to predict the set of metabolic pathways present in an organism. PathoLogic determines the set of reactions composing those pathways from the enzymes annotated in the organism's genome. Most annotation efforts fail to assign function to 40-60% of sequences. In addition, large numbers of sequences may have non-specific annotations (e.g., thiolase family protein). Pathway holes occur when a genome appears to lack the enzymes needed to catalyze reactions in a pathway. If a protein has not been assigned a specific function during the annotation process, any reaction catalyzed by that protein will appear as a missing enzyme or pathway hole in a Pathway/Genome database. RESULTS: We have developed a method that efficiently combines homology and pathway-based evidence to identify candidates for filling pathway holes in Pathway/Genome databases. Our program not only identifies potential candidate sequences for pathway holes, but combines data from multiple, heterogeneous sources to assess the likelihood that a candidate has the required function. Our algorithm emulates the manual sequence annotation process, considering not only evidence from homology searches, but also considering evidence from genomic context (i.e., is the gene part of an operon?) and functional context (e.g., are there functionally-related genes nearby in the genome?) to determine the posterior belief that a candidate has the required function. The method can be applied across an entire metabolic pathway network and is generally applicable to any pathway database. The program uses a set of sequences encoding the required activity in other genomes to identify candidate proteins in the genome of interest, and then evaluates each candidate by using a simple Bayes classifier to determine the probability that the candidate has the desired function. We achieved 71% precision at a probability threshold of 0.9 during cross-validation using known reactions in computationally-predicted pathway databases. After applying our method to 513 pathway holes in 333 pathways from three Pathway/Genome databases, we increased the number of complete pathways by 42%. We made putative assignments to 46% of the holes, including annotation of 17 sequences of previously unknown function. CONCLUSIONS: Our pathway hole filler can be used not only to increase the utility of Pathway/Genome databases to both experimental and computational researchers, but also to improve predictions of protein function.  (+info)

A transcript finishing initiative for closing gaps in the human transcriptome. (63/288)

We report the results of a transcript finishing initiative, undertaken for the purpose of identifying and characterizing novel human transcripts, in which RT-PCR was used to bridge gaps between paired EST clusters, mapped against the genomic sequence. Each pair of EST clusters selected for experimental validation was designated a transcript finishing unit (TFU). A total of 489 TFUs were selected for validation, and an overall efficiency of 43.1% was achieved. We generated a total of 59,975 bp of transcribed sequences organized into 432 exons, contributing to the definition of the structure of 211 human transcripts. The structure of several transcripts reported here was confirmed during the course of this project, through the generation of their corresponding full-length cDNA sequences. Nevertheless, for 21% of the validated TFUs, a full-length cDNA sequence is not yet available in public databases, and the structure of 69.2% of these TFUs was not correctly predicted by computer programs. The TF strategy provides a significant contribution to the definition of the complete catalog of human genes and transcripts, because it appears to be particularly useful for identification of low abundance transcripts expressed in a restricted set of tissues as well as for the delineation of gene boundaries and alternatively spliced isoforms.  (+info)

Consensus alignment server for reliable comparative modeling with distant templates. (64/288)

Consensus is a server developed to produce high-quality alignments for comparative modeling, and to identify the alignment regions reliable for copying from a given template. This is accomplished even when target-template sequence identity is as low as 5%. Combining the output from five different alignment methods, the server produces a consensus alignment, with a reliability measure indicated for each position and a prediction of the regions suitable for modeling. Models built using the server predictions are typically within 3 A rms deviations from the crystal structure. Users can upload a target protein sequence and specify a template (PDB code); if no template is given, the server will search for one. The method has been validated on a large set of homologous protein structure pairs. The Consensus server should prove useful for modelers for whom the structural reliability of the model is critical in their applications. It is currently available at http://structure.bu.edu/cgi-bin/consensus/consensus.cgi.  (+info)