Medical text representations for inductive learning. (65/4007)

Inductive learning algorithms have been proposed as methods for classifying medical text reports. Many of these proposed techniques differ in the way the text is represented for use by the learning algorithms. Slight differences can occur between representations that may be chosen arbitrarily, but such differences can significantly affect classification algorithm performance. We examined 8 different data representation techniques used for medical text, and evaluated their use with standard machine learning algorithms. We measured the loss of classification-relevant information due to each representation. Representations that captured status information explicitly resulted in significantly better performance. Algorithm performance was dependent on subtle differences in data representation.  (+info)

A data preprocessing framework for supporting probability-learning in dynamic decision modeling in medicine. (66/4007)

Data preprocessing is needed when real-life clinical databases are used as the data sources to learn the probabilities for dynamic decision models. Data preprocessing is challenging as it involves extensive manual effort and time in developing the data operation scripts. This paper presents a framework to facilitate automated and interactive generation of the problem-specific data preprocessing scripts. The framework has three major components: 1) A model parser that parses the decision model definition, 2) A graphical user interface that facilitates the interaction between the user and the system, and 3) A script generator that automatically generates the specific database scripts for the data preprocessing. We have implemented a prototype system of the framework and evaluated its effectiveness via a case study in the clinical domain. Preliminary results demonstrate the practical promise of the framework.  (+info)

Disulfide recognition in an optimized threading potential. (67/4007)

An energy potential is constructed and trained to succeed in fold recognition for the general population of proteins as well as an important class which has previously been problematic: small, disulfide-bearing proteins. The potential is modeled on solvation, with the energy a function of side chain burial and the number of disulfide bonds. An accurate disulfide recognition algorithm identifies cysteine pairs which have the appropriate orientation to form a disulfide bridge. The potential has 22 energy parameters which are optimized so the Protein Data Bank (PDB) structure for each sequence in a training set is the lowest in energy out of thousands of alternative structures. One parameter per amino acid type reflects burial preference and a single parameter is used in an overpacking term. Additionally, one optimized parameter provides a favorable contribution for each disulfide identified in a given protein structure. With little training, the potential is >80% accurate in ungapped threading tests using a variety of proteins. The same level of accuracy is observed in a threading test of small proteins which have disulfide bonds. Importantly, the energy potential is also successful with proteins having uncrosslinked cysteines.  (+info)

Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. (68/4007)

The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60-80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.  (+info)

Support vector machine classification and validation of cancer tissue samples using microarray expression data. (69/4007)

MOTIVATION: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. RESULTS: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97,802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. AVAILABILITY: The SVM software is available at http://www.cs. columbia.edu/ approximately bgrundy/svm.  (+info)

A global geometric framework for nonlinear dimensionality reduction. (70/4007)

Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.  (+info)

Nonlinear dimensionality reduction by locally linear embedding. (71/4007)

Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.  (+info)

The information content of spontaneous retinal waves. (72/4007)

Spontaneous neural activity that is present in the mammalian retina before the onset of vision is required for the refinement of retinotopy in the lateral geniculate nucleus and superior colliculus. This paper explores the information content of this retinal activity, with the goal of determining constraints on the nature of the developmental mechanisms that use it. Through information-theoretic analysis of multielectrode and calcium-imaging experiments, we show that the spontaneous retinal activity present early in development provides information about the relative positions of retinal ganglion cells and can, in principle, be used at retinogeniculate and retinocollicular synapses to refine retinotopy. Remarkably, we find that most retinotopic information provided by retinal waves exists on relatively coarse time scales, suggesting that developmental mechanisms must be sensitive to timing differences from 100 msec up to 2 sec to make optimal use of it. In fact, a simple Hebbian-type learning rule with a correlation window on the order of seconds is able to extract the bulk of the available information. These findings are consistent with bursts of action potentials (rather than single spikes) being the unit of information used during development and suggest new experimental approaches for studying developmental plasticity of the retinogeniculate and retinocollicular synapses. More generally, these results demonstrate how the properties of neuronal systems can be inferred from the statistics of their input.  (+info)