mRNA:guanine-N7 cap methyltransferases: identification of novel members of the family, evolutionary analysis, homology modeling, and analysis of sequence-structure-function relationships.
BACKGROUND: The 5'-terminal cap structure plays an important role in many aspects of mRNA metabolism. Capping enzymes encoded by viruses and pathogenic fungi are attractive targets for specific inhibitors. There is a large body of experimental data on viral and cellular methyltransferases (MTases) that carry out guanine-N7 (cap 0) methylation, including results of extensive mutagenesis. However, a crystal structure is not available and cap 0 MTases are too diverged from other MTases of known structure to allow straightforward homology-based interpretation of these data. RESULTS: We report a 3D model of cap 0 MTase, developed using sequence-to-structure threading and comparative modeling based on coordinates of the glycine N-methyltransferase. Analysis of the predicted structural features in the phylogenetic context of the cap 0 MTase family allows us to rationalize most of the experimental data available and to propose potential binding sites. We identified a case of correlated mutations in the cofactor-binding site of viral MTases that may be important for the rational drug design. Furthermore, database searches and phylogenetic analysis revealed a novel subfamily of hypothetical MTases from plants, distinct from "orthodox" cap 0 MTases. CONCLUSIONS: Computational methods were used to infer the evolutionary relationships and predict the structure of Eukaryotic cap MTase. Identification of novel cap MTase homologs suggests candidates for cloning and biochemical characterization, while the structural model will be useful in designing new experiments to better understand the molecular function of cap MTases. (+info
SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database.
Proteomics, or the direct analysis of the expressed protein components of a cell, is critical to our understanding of cellular biological processes in normal and diseased tissue. A key requirement for its success is the ability to identify proteins in complex mixtures. Recent technological advances in tandem mass spectrometry has made it the method of choice for high-throughput identification of proteins. Unfortunately, the software for unambiguously identifying peptide sequences has not kept pace with the recent hardware improvements in mass spectrometry instruments. Critical for reliable high-throughput protein identification, scoring functions evaluate the quality of a match between experimental spectra and a database peptide. Current scoring function technology relies heavily on ad-hoc parameterization and manual curation by experienced mass spectrometrists. In this work, we propose a two-stage stochastic model for the observed MS/MS spectrum, given a peptide. Our model explicitly incorporates fragment ion probabilities, noisy spectra, and instrument measurement error. We describe how to compute this probability based score efficiently, using a dynamic programming technique. A prototype implementation demonstrates the effectiveness of the model. (+info
An insight into domain combinations.
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics. (+info
Generating protein interaction maps from incomplete data: application to fold assignment.
MOTIVATION: We present a framework to generate comprehensive overviews of protein-protein interactions. In the post-genomic view of cellular function, each biological entity is seen in the context of a complex network of interactions. Accordingly, we model functional space by representing protein-protein-interaction data as undirected graphs. We suggest a general approach to generate interaction maps of cellular networks in the presence of huge amounts of fragmented and incomplete data, and to derive representations of large networks which hide clutter while keeping the essential architecture of the interaction space. This is achieved by contracting the graphs according to domain-specific hierarchical classifications. The key concept here is the notion of induced interaction, which allows the integration, comparison and analysis of interaction data from different sources and different organisms at a given level of abstraction. RESULTS: We apply this approach to compute the overlap between the DIP compendium of interaction data and a dataset of yeast two-hybrid experiments. The architecture of this network is scale-free, as frequently seen in biological networks, and this property persists through many levels of abstraction. Connections in the network can be projected downwards from higher levels of abstraction down to the level of individual proteins. As an example, we describe an algorithm for fold assignment by network context. This method currently predicts protein folds at 30% accuracy without any requirement of detectable sequence similarity of the query protein to a protein of known structure. We used this algorithm to compile a list of structural assignments for previously unassigned genes from yeast. Finally we discuss ways forward to use interaction networks for the prediction of novel protein-protein interactions. AVAILABILITY: http://www.ebi.ac.uk/~lappe/FoldPred/. (+info
Prediction of the coupling specificity of G protein coupled receptors to their G proteins.
G protein coupled receptors (GPCRs) are found in great numbers in most eukaryotic genomes. They are responsible for sensing a staggering variety of structurally diverse ligands, with their activation resulting in the initiation of a variety of cellular signalling cascades. The physiological response that is observed following receptor activation is governed by the guanine nucleotide-binding proteins (G proteins) to which a particular receptor chooses to couple. Previous investigations have demonstrated that the specificity of the receptor-G protein interaction is governed by the intracellular domains of the receptor. Despite many studies it has proven very difficult to predict de novo, from the receptor sequence alone, the G proteins to which a GPCR is most likely to couple. We have used a data-mining approach, combining pattern discovery with membrane topology prediction, to find patterns of amino acid residues in the intracellular domains of GPCR sequences that are specific for coupling to a particular functional class of G proteins. A prediction system was then built, being based upon these discovered patterns. We can report this approach was successful in the prediction of G protein coupling specificity of unknown sequences. Such predictions should be of great use in providing in silico characterisation of newly cloned receptor sequences and for improving the annotation of GPCRs stored in protein sequence databases. AVAILABILITY: http://www.ebi.ac.uk/~croning/coupling.html. (+info
Non-symmetric score matrices and the detection of homologous transmembrane proteins.
Given a transmembrane protein, we wish to find related ones by a database search. Due to the strongly hydrophobic amino acid composition of transmembrane domains, suboptimal results are obtained when general-purpose scoring matrices such as BLOSUM are used. Recently, a transmembrane-specific score matrix called PHAT was shown to perform much better than BLOSUM. In this article, we derive a transmembrane score matrix family, called SLIM, which has several distinguishing features. In contrast to currently used matrices, SLIM is non-symmetric. The asymmetry arises because different background compositions are assumed for the transmembrane query and the unknown database sequences. We describe the mathematical model behind SLIM in detail and show that SLIM outperforms PHAT both on simulated data and in a realistic setting. Since non-symmetric score matrices are a new concept in database search methods, we discuss some important theoretical and practical issues. (+info
Improved prediction of the number of residue contacts in proteins by recurrent neural networks.
Knowing the number of residue contacts in a protein is crucial for deriving constraints useful in modeling protein folding, protein structure, and/or scoring remote homology searches. Here we use an ensemble of bi-directional recurrent neural network architectures and evolutionary information to improve the state-of-the-art in contact prediction using a large corpus of curated data. The ensemble is used to discriminate between two different states of residue contacts, characterized by a contact number higher or lower than the average value of the residue distribution. The ensemble achieves performances ranging from 70.1% to 73.1% depending on the radius adopted to discriminate contacts (6Ato 12A). These performances represent gains of 15% to 20% over the base line statistical predictors always assigning an aminoacid to the most numerous state, 3% to 7% better than any previous method. Combination of different radius predictors further improves the performance. SERVER: http://promoter.ics.uci.edu/BRNN-PRED/. (+info
Protein-protein interaction map inference using interacting domain profile pairs.
A number of predictive methods have been designed to predict protein interaction from sequence or expression data. On the experimental front, however, high-throughput proteomics technologies are starting to yield large volumes of protein-protein interaction data. High-quality experimental protein interaction maps constitute the natural dataset upon which to build interaction predictions. Thus the motivation to develop the first interaction-based protein interaction map prediction algorithm. A technique to predict protein-protein interaction maps across organisms is introduced, the 'interaction-domain pair profile' method. The method uses a high-quality protein interaction map with interaction domain information as input to predict an interaction map in another organism. It combines sequence similarity searches with clustering based on interaction patterns and interaction domain information. We apply this approach to the prediction of an interaction map of Escherichia coli from the recently published interaction map of the human gastric pathogen Helicobacter pylori. Results are compared with predictions of a second inference method based only on full-length protein sequence similarity - the "naive" method. The domain-based method is shown to i) eliminate a significant amount of false-positives of the naive method that are the consequences of multi-domain proteins; ii) increase the sensitivity compared to the naive method by identifying new potential interactions. AVAILABILITY: Contact the authors. (+info