Homonyms and synonyms in the Dictionary of Interfaces in Proteins (DIP).
MOTIVATION: Should reports on molecular mimicry in particular cases, e.g. responsible for cross-reactivity, be considered as accidental or as a general principle in protein evolution? To answer this question, two types of similarity have to be considered: those in homologues (synonyms) and resemblance between patches from unrelated proteins (homonyms). RESULTS: All interfaces from known protein structures were collected in a comprehensive data bank [Dictionary of Interfaces in Proteins (DIP)]. A fast, sequence-independent, three-dimensional superposition procedure was developed to search automatically for geometrically similar surface areas. Surprisingly, we found a large number of structurally similar interfaces on the surface of unrelated proteins. Even patches from different types of secondary structure were found resembling each other. The putative functional meaning of homonyms is demonstrated with striking examples. (+info)
Dictionary-driven protein annotation.
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. (+info)
Protein names precisely peeled off free text.
MOTIVATION: Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. RESULTS: Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75% at a recall (coverage) of 76% after training on a corpus, which was used before by other groups and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. AVAILABILITY: http://cubic.bioc.columbia.edu/services/nlprot/ (+info)
MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database.
MOTIVATION: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. RESULTS: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4. AVAILABILITY: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. (+info)
ChEBI: a database and ontology for chemical entities of biological interest.
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/ (+info)
Remediation of the protein data bank archive.
The Worldwide Protein Data Bank (wwPDB; wwpdb.org) is the international collaboration that manages the deposition, processing and distribution of the PDB archive. The online PDB archive at ftp://ftp.wwpdb.org is the repository for the coordinates and related information for more than 47 000 structures, including proteins, nucleic acids and large macromolecular complexes that have been determined using X-ray crystallography, NMR and electron microscopy techniques. The members of the wwPDB-RCSB PDB (USA), MSD-EBI (Europe), PDBj (Japan) and BMRB (USA)-have remediated this archive to address inconsistencies that have been introduced over the years. The scope and methods used in this project are presented. (+info)
Detection of IUPAC and IUPAC-like chemical names.