Information extraction in molecular biology.
Information extraction has become a very active field in bioinformatics recently and a number of interesting papers have been published. Most of the efforts have been concentrated on a few specific problems, such as the detection of protein-protein interactions and the analysis of DNA expression arrays, although it is obvious that there are many other interesting areas of potential application (document retrieval, protein functional description, and detection of disease-related genes to name a few). Paradoxically, these exciting developments have not yet crystallised into general agreement on a set of standard evaluation criteria, such as the ones developed in fields such as protein structure prediction, which makes it very difficult to compare performance across these different systems. In this review we introduce the general field of information extraction, we outline the status of the applications in molecular biology, and we then discuss some ideas about possible standards for evaluation that are needed for the future development of the field. (+info
Predicting transcription factor synergism.
Transcriptional regulation is mediated by a battery of transcription factor (TF) proteins, that form complexes involving protein-protein and protein-DNA interactions. Individual TFs bind to their cognate cis-elements or transcription factor-binding sites (TFBS). TFBS are organized on the DNA proximal to the gene in groups confined to a few hundred base pair regions. These groups are referred to as modules. Various modules work together to provide the combinatorial regulation of gene transcription in response to various developmental and environmental conditions. The sets of modules constitute a promoter model. Determining the TFs that preferentially work in concert as part of a module is an essential component of understanding transcriptional regulation. The TFs that act synergistically in such a fashion are likely to have their cis-elements co-localized on the genome at specific distances apart. We exploit this notion to predict TF pairs that are likely to be part of a transcriptional module on the human genome sequence. The computational method is validated statistically, using known interacting pairs extracted from the literature. There are 251 TFBS pairs up to 50 bp apart and 70 TFBS pairs up to 200 bp apart that score higher than any of the known synergistic pairs. Further investigation of 50 pairs randomly selected from each of these two sets using PubMed queries provided additional supporting evidence from the existing biological literature suggesting TF synergism for these novel pairs. (+info
An intelligent biological information management system.
MOTIVATION: As biomedical researchers are amassing a plethora of information in a variety of forms resulting from the advancements in biomedical research, there is a critical need for innovative information management and knowledge discovery tools to sift through these vast volumes of heterogeneous data and analysis tools. In this paper we present a general model for an information management system that is adaptable and scalable, followed by a detailed design and implementation of one component of the model. The prototype, called BioSifter, was applied to problems in the bioinformatics area. RESULTS: BioSifter was tested using 500 documents obtained from PubMed database on two biological problems related to genetic polymorphism and extracorporal shockwave lithotripsy. The results indicate that BioSifter is a powerful tool for biological researchers to automatically retrieve relevant text documents from biological literature based on their interest profile. The results also indicate that the first stage of information management process, i.e. data to information transformation, significantly reduces the size of the information space. The filtered data obtained through BioSifter is relevant as well as much smaller in dimension compared to all the retrieved data. This would in turn significantly reduce the complexity associated with the next level transformation, i.e. information to knowledge. (+info
Identifying diagnostic studies in MEDLINE: reducing the number needed to read.
OBJECTIVES: The search filters in PubMed have become a cornerstone in information retrieval in evidence-based practice. However, the filter for diagnostic studies is not fully satisfactory, because sensitive searches have low precision. The objective of this study was to construct and validate better search strategies to identify diagnostic articles recorded on MEDLINE with special emphasis on precision. DESIGN: A comparative, retrospective analysis was conducted. Four medical journals were hand-searched for diagnostic studies published in 1989 and 1994. Four other journals were hand-searched for 1999. The three sets of studies identified were used as gold standards. A new search strategy was constructed and tested using the 1989-subset of studies and validated in both the 1994 and 1999 subsets. We identified candidate text words for search strategies using a word frequency analysis of the abstracts. According to the frequency of identified terms, searches were run for each term independently. The sensitivity, precision, and number needed to read (1/precision) of every candidate term were calculated. Terms with the highest sensitivity x precision product were used as free text terms in combination with the MeSH term "SENSITIVITY AND SPECIFICITY" using the Boolean operator OR. In the 1994 and 1999 subsets, we performed head-to-head comparisons of the currently available PubMed filter with the one we developed. MEASUREMENTS: The sensitivity, precision and the number needed to read (1/precision) were measured for different search filters. RESULTS: The most frequently occurring three truncated terms (diagnos*; predict* and accura*) in combination with the MeSH term "SENSITIVITY AND SPECIFICITY" produced a sensitivity of 98.1 percent (95% confidence interval: 89.9-99.9%) and a number needed to read of 8.3 (95% confidence interval: 6.7-11.3%). In direct comparisons of the new filter with the currently available one in PubMed using the 1994 and 1999 subsets, the new filter achieved better precision (12.0% versus 8.2% in 1994 and 5.0% versus 4.3% in 1999. The 95% confidence intervals for the differences range from 0.05% to 7.5% (p = 0.041) and -1.0% to 2.3% (p = 0.45), respectively). The new filter achieved slightly better sensitivities than the currently available one in both subsets, namely 98.1 and 96.1% (p = 0.32) versus 95.1 and 88.8% (p = 0.125). CONCLUSIONS: The quoted performance of the currently available filter for diagnostic studies in PubMed may be overstated. It appears that even single external validation may lead to over optimistic views of a filter's performance. Precision appears to be more unstable than sensitivity. In terms of sensitivity, our filter for diagnostic studies performed slightly better than the currently available one and it performed better with regards to precision in the 1994 subset. Additional research is required to determine whether these improvements are beneficial to searches in practice. (+info
Use of the Internet and information technology for surgeons and surgical research.
The recent, and extensive, expansion in the use of computers and the Internet offers great potential for benefit in surgical research and, increasingly, surgical practice. However, in addition to the usefulness of information technology, much time can be spent achieving little and the potential missed because of the complexity and excess of information available. In this article, we examine some useful areas relevant to surgeons and surgical research, such as Internet service provision and E-mail, databases, medical Websites, and potential future directions. (+info
Using LOINC to link an EMR to the pertinent paragraph in a structured reference knowledge base.
Intermountain Health Care has integrated the electronic medical record (EMR) with online information resources in order to create easy access to a knowledge base which practicing physicians can use at the point of care. When a user is reviewing problems/diagnosis, medications, or clinical laboratory test results, they can conveniently access a "pertinent paragraph" of reference literature that pertains to the clinical data in the EMR. Using terminology first coined by Cimino1, we call this application the "infobutton." We describe the architectural issues involved in linking our electronic medical record with a structured laboratory knowledge base. The application has been well received as noted by anecdotal comments made by physicians and usage of the application. (+info
Finding UMLS Metathesaurus concepts in MEDLINE.
The entire collection of 11.5 million MEDLINE abstracts was processed to extract 549 million noun phrases using a shallow syntactic parser. English language strings in the 2002 and 2001 releases of the UMLS Metathesaurus were then matched against these phrases using flexible matching techniques. 34% of the Metathesaurus names (occurring in 30% of the concepts) were found in the titles and abstracts of articles in the literature. The matching concepts are fairly evenly chemical and non-chemical in nature and span a wide spectrum of semantic types. This paper details the approach taken and the results of the analysis. (+info
A literature-based method for assessing the functional coherence of a gene group.
MOTIVATION: Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes. RESULTS: We evaluate NDPG on 2796 functional groups generated by the Gene Ontology consortium in four organisms: mouse, fly, worm and yeast. NDPG finds functional coherence in 96, 92, 82 and 45% of the groups (at 99.9% specificity) in yeast, mouse, fly and worm respectively. (+info