Protein motif extraction with neuro-fuzzy optimization. (57/3779)

MOTIVATION: It is attempted to improve the speed and flexibility of protein motif identification. The proposed algorithm is able to extract both rigid and flexible protein motifs. RESULTS: In this work, we present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences. This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies. Fuzzy logic is used to increase the flexibility of protein motifs. C2H2 Zinc Finger Protein and epidermal growth factor protein sequences are used to demonstrate the capability of the proposed algorithm in finding motifs. AVAILABILITY: This program is freely available for academic use by request.  (+info)

Probabilistic alignment of motifs with sequences. (58/3779)

MOTIVATION: Motif detection is an important component of the classification and annotation of protein sequences. A method for aligning motifs with an amino acid sequence is introduced. The motifs can be described by the secondary (i.e. functional, biophysical, etc.) characteristics of a signal or pattern to be detected. The results produced are based on the statistical relevance of the alignment. The method was targeted to avoid the problems (i.e. over-fitting, biological interpretation and mathematical soundness) encountered in other methods currently available. RESULTS: The method was tested on lipoprotein signals in B. subtilis yielding stable results. The results of signal prediction were consistent with other methods where literature was available. AVAILABILITY: An implementation of the motif alignment, refining and bootstrapping is available for public use online at http://www.expasy.org/tools/patoseq/  (+info)

Tagging gene and protein names in biomedical text. (59/3779)

MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation. RESULTS: We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. AVAILABILITY: The programs are available on request from the authors.  (+info)

TFBS: Computational framework for transcription factor binding site analysis. (60/3779)

MOTIVATION: TFBS is a set of integrated, object-oriented Perl modules for transcription factor binding site detection and analysis. It implements objects representing specificity profile matrices, binding sites and sets thereof, pattern generators, and pattern database interfaces. The modules are interoperable with the BioPerl open source system. AVAILABILITY AND SUPPLEMENTARY INFORMATION: The module package with documentation and example scripts are available at http://forkhead.cgb.ki.se/TFBS/  (+info)

Image understanding methods in biomedical informatics and digital imaging. (61/3779)

This paper will present new possibilities for the application of image recognition methods and AI application in biomedical informatics as well as semantically oriented analysis of 2D images of coronary arteries originating from coronography examinations. In particular this paper presents the possibilities for computer analysis and recognition of local stenoses of the lumen of coronary arteries via the application of syntactic methods of pattern recognition. Such stenoses are the result of the appearance of arteriosclerosis plaques, which in consequence lead to different forms of ischemic cardiovascular diseases. Such diseases may be seen in the form of stable or unstable disturbances of heart rhythm or infarction. Analysis of the correct morphology of these artery lumina is made possible with the application of syntactic analysis and pattern recognition methods, in particular with the attribute, context-free grammar of look-ahead LR(1) type.  (+info)

Identification of regulatory elements using a feature selection method. (62/3779)

MOTIVATION: Many methods have been described to identify regulatory motifs in the transcription control regions of genes that exhibit similar patterns of gene expression across a variety of experimental conditions. Here we focus on a single experimental condition, and utilize gene expression data to identify sequence motifs associated with genes that are activated under this experimental condition. We use a linear model with two-way interactions to model gene expression as a function of sequence features (words) present in presumptive transcription control regions. The most relevant features are selected by a feature selection method called stepwise selection with monte carlo cross validation. We apply this method to a publicly available dataset of the yeast Saccharomyces cerevisiae, focussing on the 800 basepairs immediately upstream of each gene's translation start site (the upstream control region (UCR)). RESULTS: We successfully identify regulatory motifs that are known to be active under the experimental conditions analyzed, and find additional significant sequences that may represent novel regulatory motifs. We also discuss a complementary method that utilizes gene expression data from a single microarray experiment and allows averaging over variety of experimental conditions as an alternative to motif finding methods that act on clusters of co-expressed genes. AVAILABILITY: The software is available upon request from the first author or may be downloaded from http://www.stat.berkeley.edu/~sunduz. CONTACT: [email protected]  (+info)

Ratio statistics of gene expression levels and applications to microarray data analysis. (63/3779)

MOTIVATION: Expression-based analysis for large families of genes has recently become possible owing to the development of cDNA microarrays, which allow simultaneous measurement of transcript levels for thousands of genes. For each spot on a microarray, signals in two channels must be extracted from their backgrounds. This requires algorithms to extract signals arising from tagged mRNA hybridized to arrayed cDNA locations and algorithms to determine the significance of signal ratios. RESULTS: This paper focuses on estimation of signal ratios from the two channels, and the significance of those ratios. The key issue is the determination of whether a ratio is significantly high or low in order to conclude whether the gene is upregulated or downregulated. The paper builds on an earlier study that involved a hypothesis test based on a ratio statistic under the supposition that the measured fluorescent intensities subsequent to image processing can be assumed to reflect the signal intensities. Here, a refined hypothesis test is considered in which the measured intensities forming the ratio are assumed to be combinations of signal and background. The new method involves a signal-to-noise ratio, and for a high signal-to-noise ratio the new test reduces (with close approximation) to the original test. The effect of low signal-to-noise ratio on the ratio statistics constitutes the main theme of the paper. Finally, and in this vein, a quality metric is formulated for spots. This measure can be used to decide whether or not a spot ratio should be deleted, or to adjust various measurements to reflect confidence in the quality of the measurement. CONTACT: [email protected]  (+info)

ESyPred3D: Prediction of proteins 3D structures. (64/3779)

MOTIVATION: Homology or comparative modeling is currently the most accurate method to predict the three-dimensional structure of proteins. It generally consists in four steps: (1) databanks searching to identify the structural homolog, (2) target-template alignment, (3) model building and optimization, and (4) model evaluation. The target-template alignment step is generally accepted as the most critical step in homology modeling. RESULTS: We present here ESyPred3D, a new automated homology modeling program. The method gets benefit of the increased alignment performances of a new alignment strategy. Alignments are obtained by combining, weighting and screening the results of several multiple alignment programs. The final three-dimensional structure is build using the modeling package MODELLER. ESyPred3D was tested on 13 targets in the CASP4 experiment (Critical Assessment of Techniques for Proteins Structural Prediction). Our alignment strategy obtains better results compared to PSI-BLAST alignments and ESyPred3D alignments are among the most accurate compared to those of participants having used the same template. AVAILABILITY: ESyPred3D is available through its web site at http://www.fundp.ac.be/urbm/bioinfo/esypred/ CONTACT: [email protected]; http://www.fundp.ac.be/~lambertc  (+info)