Mapping knowledge domains: characterizing PNAS. (25/351)

A review of data mining and analysis techniques that can be used for the mapping of knowledge domains is given. Literature mapping techniques can be based on authors, documents, journals, words, and/or indicators. Most mapping questions are related to research assessment or to the structure and dynamics of disciplines or networks. Several mapping techniques are demonstrated on a data set comprising 20 years of papers published in PNAS. Data from a variety of sources are merged to provide unique indicators of the domain bounded by PNAS. By using funding source information and citation counts, it is shown that, on an aggregate basis, papers funded jointly by the U.S. Public Health Service (which includes the National Institutes of Health) and non-U.S. government sources outperform papers funded by other sources, including by the U.S. Public Health Service alone. Grant data from the National Institute on Aging show that, on average, papers from large grants are cited more than those from small grants, with performance increasing with grant amount. A map of the highest performing papers over the 20-year period was generated by using citation analysis. Changes and trends in the subjects of highest impact within the PNAS domain are described. Interactions between topics over the most recent 5-year period are also detailed.  (+info)

The simultaneous evolution of author and paper networks. (26/351)

There has been a long history of research into the structure and evolution of mankind's scientific endeavor. However, recent progress in applying the tools of science to understand science itself has been unprecedented because only recently has there been access to high-volume and high-quality data sets of scientific output (e.g., publications, patents, grants) and computers and algorithms capable of handling this enormous stream of data. This article reviews major work on models that aim to capture and recreate the structure and dynamics of scientific evolution. We then introduce a general process model that simultaneously grows coauthor and paper citation networks. The statistical and dynamic properties of the networks generated by this model are validated against a 20-year data set of articles published in PNAS. Systematic deviations from a power law distribution of citations to papers are well fit by a model that incorporates a partitioning of authors and papers into topics, a bias for authors to cite recent papers, and a tendency for authors to cite papers cited by papers that they have read. In this TARL model (for topics, aging, and recursive linking), the number of topics is linearly related to the clustering coefficient of the simulated paper citation network.  (+info)

The architecture of complex weighted networks. (27/351)

Networked structures arise in a wide array of different contexts such as technological and transportation infrastructures, social phenomena, and biological systems. These highly interconnected systems have recently been the focus of a great deal of attention that has uncovered and characterized their topological complexity. Along with a complex topological structure, real networks display a large heterogeneity in the capacity and intensity of the connections. These features, however, have mainly not been considered in past studies where links are usually represented as binary states, i.e., either present or absent. Here, we study the scientific collaboration network and the world-wide air-transportation network, which are representative examples of social and large infrastructure systems, respectively. In both cases it is possible to assign to each edge of the graph a weight proportional to the intensity or capacity of the connections among the various elements of the network. We define appropriate metrics combining weighted and topological observables that enable us to characterize the complex statistical properties and heterogeneity of the actual strength of edges and vertices. This information allows us to investigate the correlations among weighted quantities and the underlying topological structure of the network. These results provide a better description of the hierarchies and organizational principles at the basis of the architecture of weighted networks.  (+info)

A tool for gene expression based PubMed search through combining data sources. (28/351)

We present a new tool for the semi-automated querying of PubMed using a batch of tens to thousands of GenBank accession numbers or UniGene cluster ids. By combining information from UniGene and SWISS-PROT, microGENIE obtains information on the biological relevance of expressed genes, as identified by micro-array experiments, with minimal user intervention and time investment. AVAILABILITY: microGENIE is freely available from http://www.cs.vu.nl/microgenie SUPPLEMENTARY INFORMATION: The web site above supplies examples of input and output files.  (+info)

B.E.A.R. GeneInfo: a tool for identifying gene-related biomedical publications through user modifiable queries. (29/351)

BACKGROUND: Once specific genes are identified through high throughput genomics technologies there is a need to sort the final gene list to a manageable size for validation studies. The triaging and sorting of genes often relies on the use of supplemental information related to gene structure, metabolic pathways, and chromosomal location. Yet in disease states where the genes may not have identifiable structural elements, poorly defined metabolic pathways, or limited chromosomal data, flexible systems for obtaining additional data are necessary. In these situations having a tool for searching the biomedical literature using the list of identified genes while simultaneously defining additional search terms would be useful. RESULTS: We have built a tool, BEAR GeneInfo, that allows flexible searches based on the investigators knowledge of the biological process, thus allowing for data mining that is specific to the scientist's strengths and interests. This tool allows a user to upload a series of GenBank accession numbers, Unigene Ids, Locuslink Ids, or gene names. BEAR GeneInfo takes these IDs and identifies the associated gene names, and uses the lists of gene names to query PubMed. The investigator can add additional modifying search terms to the query. The subsequent output provides a list of publications, along with the associated reference hyperlinks, for reviewing the identified articles for relevance and interest. An example of the use of this tool in the study of human prostate cancer cells treated with Selenium is presented. CONCLUSIONS: This tool can be used to further define a list of genes that have been identified through genomic or genetic studies. Through the use of targeted searches with additional search terms the investigator can limit the list to genes that match their specific research interests or needs. The tool is freely available on the web at http://prostategenomics.org1, and the authors will provide scripts and database components if requested [email protected]  (+info)

NLProt: extracting protein names and sequences from papers. (30/351)

Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.  (+info)

Unitary or unified taxonomy? (31/351)

Taxonomic data form a substantial, but scattered, resource. The alternative to such a fragmented system is a 'unitary' one of preferred, consensual classifications. For effective access and distribution the (Web) revision for a given taxon would be established at a single Internet site. Although all the international codes of nomenclature currently preclude the Internet as a valid medium of publication, elements of unitary taxonomy (UT) still exist in the paper system. Much taxonomy, unitary or not, already resides on the Web. Arguments for and against adopting a unitary approach are considered and a resolution is attempted. Rendering taxonomy essentially Web-based is as inevitable as it is desirable. Apparently antithetical to the UT proposal is the view that in reality multiple classifications of the same taxon exist, since different taxonomists often hold different concepts of their taxa: a single name may apply to many different (frequently overlapping) circumscriptions and more than one name to a single taxon. However, novel means are being developed on single Internet sites to retain the diversity of multiple concepts for taxa, providing hope that taxonomy may become established as a Web-based information discipline that will unify the discipline and facilitate data access.  (+info)

An online tutorial for helping nonscience majors read primary research literature in biology. (32/351)

Using primary literature is an effective tool for promoting active learning and critical thinking in science classes. However, it can be challenging to use primary literature in large classes and in classes for nonscience majors. We describe the development and implementation of an online tutorial for helping nonscience majors learn to read primary literature in biology. The tutorial includes content about the scientific process and the structure of scientific papers and provides opportunities for students to practice reading primary literature. We describe the use of the tutorial in Biology of Exercise, a course for nonscience majors. Students used the tutorial outside of class to learn the basic principles involved in reading scientific papers, enabling class sessions to focus on active-learning activities and substantive class discussions.  (+info)