Working more productively: tools for administrative data. (9/73)

OBJECTIVE: This paper describes a web-based resource (http://www.umanitoba.ca/centres/mchp/concept/) that contains a series of tools for working with administrative data. This work in knowledge management represents an effort to document, find, and transfer concepts and techniques, both within the local research group and to a more broadly defined user community. Concepts and associated computer programs are made as "modular" as possible to facilitate easy transfer from one project to another. STUDY SETTING/DATA SOURCES: Tools to work with a registry, longitudinal administrative data, and special files (survey and clinical) from the Province of Manitoba, Canada in the 1990-2003 period. DATA COLLECTION: Literature review and analyses of web site utilization were used to generate the findings. PRINCIPAL FINDINGS: The Internet-based Concept Dictionary and SAS macros developed in Manitoba are being used in a growing number of research centers. Nearly 32,000 hits from more than 10,200 hosts in a recent month demonstrate broad interest in the Concept Dictionary. CONCLUSIONS: The tools, taken together, make up a knowledge repository and research production system that aid local work and have great potential internationally. Modular software provides considerable efficiency. The merging of documentation and researcher-to-researcher dissemination keeps costs manageable.  (+info)

A "systematics" tool for medical terminologies. (10/73)

Finding the hierarchical relations amongst multiple terms within medical terminologies that support multiple parents to a term is a common task, especially for trainees and knowledge engineers implementing or maintaining medical logic modules or guidelines. Examples of such terminologies include the UMLS and the Medical Entity Dictionary (MED). In addition, the task of identifying and discriminating amongst some common ancestors to a list of terms is a recurrent theme. This is also a common concern in the science of classification (systematics). Some nearest common ancestors have distinct valuable properties for classification and simplification of lists. Although there exist some visualized navigating and editing tools for the UMLS and the MED, they are browsers that show a large number of unrelated and irrelevant relationships to the task at hand. While algorithms have been well studied in computer science to solve such a problem over semantic networks and trees, to our knowledge, they have not been used with a visualization tool in biomedicine. We developed a visualized tool that graphically displays the hierarchical relations of multiple terms, and helps identifying the nearest common ancestors of these terms.  (+info)

The design and implementation of a picklist authoring tool. (11/73)

It is well recognized that controlled medical terminologies play a critical role in Health Information Systems and Clinical Patient Record systems, but the creation and management of customized lists of terms ("picklists") remains a potential obstacle. We have been developing a sophisticated authoring tool that is fully integrated with our terminology server and that will be made available to our system analysts and clinicians.  (+info)

GAPSCORE: finding gene and protein names one word at a time. (12/73)

MOTIVATION: New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context. RESULTS: We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs. AVAILABILITY: GAPSCORE is available at http://bionlp.stanford.edu/gapscore/  (+info)

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts. (13/73)

OBJECTIVE: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. DESIGN: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. MEASUREMENTS: The recall and precision of the system have been determined using 1000 randomly selected and hand-tagged Medline abstracts. RESULTS: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively. CONCLUSION: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.  (+info)

Protein names precisely peeled off free text. (14/73)

MOTIVATION: Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. RESULTS: Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75% at a recall (coverage) of 76% after training on a corpus, which was used before by other groups and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. AVAILABILITY: http://cubic.bioc.columbia.edu/services/nlprot/  (+info)

Improving the performance of dictionary-based approaches in protein name recognition. (15/73)

Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%.  (+info)

Using name-internal and contextual features to classify biological terms. (16/73)

There has been considerable work done recently in recognizing named entities in biomedical text. In this paper, we investigate the named entity classification task, an integral part of the named entity extraction task. We focus on the different sources of information that can be utilized for classification, and note the extent to which they are effective in classification. To classify a name, we consider features that appear within the name as well as nearby phrases. We also develop a new strategy based on the context of occurrence and show that they improve the performance of the classification system. We show how our work relates to previous works on named entity classification in the biological domain as well as to those in generic domains. The experiments were conducted on the GENIA corpus Ver. 3.0 developed at University of Tokyo. We achieve f value of 86 in 10-fold cross validation evaluation on this corpus.  (+info)