CoPub Mapper: mining MEDLINE based on search term co-publication. (57/562)

BACKGROUND: High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned. RESULTS: MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence. CONCLUSION: The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data.  (+info)

Integration of text- and data-mining using ontologies successfully selects disease gene candidates. (58/562)

Genome-wide techniques such as microarray analysis, Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS), linkage analysis and association studies are used extensively in the search for genes that cause diseases, and often identify many hundreds of candidate disease genes. Selection of the most probable of these candidate disease genes for further empirical analysis is a significant challenge. Additionally, identifying the genes that cause complex diseases is problematic due to low penetrance of multiple contributing genes. Here, we describe a novel bioinformatic approach that selects candidate disease genes according to their expression profiles. We use the eVOC anatomical ontology to integrate text-mining of biomedical literature and data-mining of available human gene expression data. To demonstrate that our method is successful and widely applicable, we apply it to a database of 417 candidate genes containing 17 known disease genes. We successfully select the known disease gene for 15 out of 17 diseases and reduce the candidate gene set to 63.3% (+/-18.8%) of its original size. This approach facilitates direct association between genomic data describing gene expression and information from biomedical texts describing disease phenotype, and successfully prioritizes candidate genes according to their expression in disease-affected tissues.  (+info)

Using the biological taxonomy to access biological literature with PathBinderH. (59/562)

PathBinderH allows users to make queries that retrieve sentences and the abstracts containing them from PubMed. Another aspect of PathBinderH is that users can specify biological taxa in order to limit searches by mentioning either the specified taxa, or their subordinate taxa, in the biological taxonomy. Although the current project requires this function only for plant taxa, the principle is extensible to the entire taxonomy. AVAILABILITY: www.plantgenomics.iastate.edu/PathBinderH. Source code and databases on request.  (+info)

Building a protein name dictionary from full text: a machine learning term extraction approach. (60/562)

BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.  (+info)

A survey of current work in biomedical text mining. (61/562)

The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in coping with this information overload are text mining and knowledge extraction. Significant progress has been made in applying text mining to named entity recognition, text classification, terminology extraction, relationship extraction and hypothesis generation. Several research groups are constructing integrated flexible text-mining systems intended for multiple uses. The major challenge of biomedical text mining over the next 5-10 years is to make these systems useful to biomedical researchers. This will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed.  (+info)

MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms. (62/562)

MeSHer uses a simple statistical approach to identify biological concepts in the form of Medical Subject Headings (MeSH terms) obtained from the PubMed database that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying DNA microarray platform. As a demonstration, we apply this approach to gene lists acquired from a published study of the effects of angiotensin II (Ang II) treatment on cardiac gene expression and demonstrate that this approach can aid in the interpretation of the resulting 'significant' gene set. AVAILABILITY: The software is available at http://www.tm4.org. SUPPLEMENTARY INFORMATION: Results from the analysis of significant genes from the published Ang II study.  (+info)

Do you have NIH funding? Then read this. (63/562)

In the "Policy on enhancing public access to archived publications resulting from NIH-funded research," the NIH requests that all publications resulting from primary research supported by NIH grants be deposited in PubMed Central (PMC), the online repository of the National Library of Medicine. The NIH requests that all manuscripts accepted for publication after May 2, 2005 be deposited in PMC, and that those manuscripts be made freely available to the public within 12 months of publication. The JCI supports this policy: we will continue to make all content freely available in PMC immediately upon publication, and the entire JCI archive is freely available through PMC.  (+info)

MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results. (64/562)

MineBlast is a web service for literature search and presentation based on data-mining results received from UniProt. Users can submit a simple list of protein sequences via a web-based interface. MineBlast performs a BLASTP search in UniProt to identify names and synonyms based on homologous proteins and subsequently queries PubMed, using combined search terms inorder to find and present relevant literature.  (+info)