The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues.
A consensus approach has been developed for identifying distant structural homologues. This is based on the CATH Dictionary of Homologous Superfamilies (DHS), a database of validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies (URL: http://www. biochem.ucl.ac.uk/bsm/dhs). Multiple structural alignments have been generated for 362 well-populated superfamilies in the CATH structural domain database and annotated with secondary structure, physicochemical properties, functional sequence patterns and protein-ligand interaction data. Consensus functional information for each superfamily includes descriptions and keywords extracted from SWISS-PROT and the ENZYME database. The Dictionary provides a powerful resource to validate, examine and visualize key structural and functional features of each homologous superfamily. The value of the DHS, for assessing functional variability and identifying distant evolutionary relationships, is illustrated using the pyridoxal-5'-phosphate (PLP) binding aspartate aminotransferase superfamily. The DHS also provides a tool for examining sequence-structure relationships for proteins within each fold group. (+info)
Organizing the present, looking to the future: an online knowledge repository to facilitate collaboration.
BACKGROUND: Comprehensive data available in the Canadian province of Manitoba since 1970 have aided study of the interaction between population health, health care utilization, and structural features of the health care system. Given a complex linked database and many ongoing projects, better organization of available epidemiological, institutional, and technical information was needed. OBJECTIVE: The Manitoba Centre for Health Policy and Evaluation wished to develop a knowledge repository to handle data, document research Methods, and facilitate both internal communication and collaboration with other sites. METHODS: This evolving knowledge repository consists of both public and internal (restricted access) pages on the World Wide Web (WWW). Information can be accessed using an indexed logical format or queried to allow entry at user-defined points. The main topics are: Concept Dictionary, Research Definitions, Meta-Index, and Glossary. The Concept Dictionary operationalizes concepts used in health research using administrative data, outlining the creation of complex variables. Research Definitions specify the codes for common surgical procedures, tests, and diagnoses. The Meta-Index organizes concepts and definitions according to the Medical Sub-Heading (MeSH) system developed by the National Library of Medicine. The Glossary facilitates navigation through the research terms and abbreviations in the knowledge repository. An Education Resources heading presents a web-based graduate course using substantial amounts of material in the Concept Dictionary, a lecture in the Epidemiology Supercourse, and material for Manitoba's Regional Health Authorities. Confidential information (including Data Dictionaries) is available on the Centre's internal website. RESULTS: Use of the public pages has increased dramatically since January 1998, with almost 6,000 page hits from 250 different hosts in May 1999. More recently, the number of page hits has averaged around 4,000 per month, while the number of unique hosts has climbed to around 400. CONCLUSIONS: This knowledge repository promotes standardization and increases efficiency by placing concepts and associated programming in the Centre's collective memory. Collaboration and project management are facilitated. (+info)
The role of definitions in biomedical concept representation.
The Foundational Model (FM) of anatomy, developed as an anatomical enhancement of UMLS, classifies anatomical entities in a structural context. Explicit definitions have played a critical role in the establishment of FM classes. Essential structural properties that distinguish a group of anatomical entities serve as the differentiate for defining classes. These, as well as other structural attributes, are introduced as template slots in Protege, a frame-based knowledge acquisition system, and are inherited by descendants of the class. A set of desiderata has evolved during the instantiation of the FM for formulating definitions. We contend that 1. these desiderata generalize to non-anatomical domains and 2. satisfying them in constituent vocabularies of UMLS would enhance the quality of information retrievable through UMLS. (+info)
Creating an online dictionary of abbreviations from MEDLINE.
OBJECTIVE: The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions. DESIGN: Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. MEASUREMENTS: We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. RESULTS: On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. CONCLUSION: We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/]. (+info)
Finding relevant references to genes and proteins in Medline using a Bayesian approach.
MOTIVATION: Mining the biomedical literature for references to genes and proteins always involves a tradeoff between high precision with false negatives, and high recall with false positives. Having a reliable method for assessing the relevance of literature mining results is crucial to finding ways to balance precision and recall, and for subsequently building automated systems to analyze these results. We hypothesize that abstracts and titles that discuss the same gene or protein use similar words. To validate this hypothesis, we built a dictionary- and rule-based system to mine Medline for references to genes and proteins, and used a Bayesian metric for scoring the relevance of each reference assignment. RESULTS: We analyzed the entire set of Medline records from 1966 to late 2001, and scored each gene and protein reference using a Bayesian estimated probability (EP) based on word frequency in a training set of 137837 known assignments from 30594 articles to 36197 gene and protein symbols. Two test sets of 148 and 150 randomly chosen assignments, respectively, were hand-validated and categorized as either good or bad. The distributions of EP values, when plotted on a log-scale histogram, are shown to markedly differ between good and bad assignments. Using EP values, recall was 100% at 61% precision (EP=2 x 10(-5)), 63% at 88% precision (EP=0.008), and 10% at 100% precision (EP=0.1). These results show that Medline entries discussing the same gene or protein have similar word usage, and that our method of assessing this similarity using EP values is valid, and enables an EP cutoff value to be determined that accurately and reproducibly balances precision and recall, allowing automated analysis of literature mining results. . (+info)
The Protein Data Bank and structural genomics.
The Protein Data Bank (PDB; http://www.pdb.org/) continues to be actively involved in various aspects of the informatics of structural genomics projects--developing and maintaining the Target Registration Database (TargetDB), organizing data dictionaries that will define the specification for the exchange and deposition of data with the structural genomics centers and creating software tools to capture data from standard structure determination applications. (+info)
This glossary aims to provide readers with some of the key terms that are relevant to a consideration of the relevance of social capital for health, and to introduce some of the debates on the concepts. (+info)
Extraction of protein interaction information from unstructured text using a context-free grammar.
MOTIVATION: As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published. RESULTS: In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships. AVAILABILITY: The program and corpus are available by request from the authors. (+info)