Feature selection and transduction for prediction of molecular bioactivity for drug design. (49/3813)

MOTIVATION: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. RESULTS: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type.  (+info)

Comparison of three multitrait methods for QTL detection. (50/3813)

A comparison of power and accuracy of estimation of position and QTL effects of three multitrait methods and one single trait method for QTL detection was carried out on simulated data, taking into account the mixture of full and half-sib families. One multitrait method was based on a multivariate function as the penetrance function (MV). The two other multitrait methods were based on univariate analysis of linear combination(s) (LC) of the traits. One was obtained by a principal component analysis (PCA) performed on the phenotypic data. The second was based on a discriminate analysis (DA). It calculates a LC of the traits at each position, maximising the ratio between the genetic and the residual variabilities due to the putative QTL. Due to its number of parameters, MV was less powerful and accurate than the other methods. In general, DA better detected QTL, but it had lower accuracy for the QTL effect estimation when the detection power was low, due to higher bias than the other methods. In this case, PCA was better. Otherwise, PCA was slightly less powerful and accurate than DA. Compared to the single trait method, power can be improved by 30% to 100% with multitrait methods.  (+info)

Mining HIV dynamics using independent component analysis. (51/3813)

MOTIVATION: We implement a data mining technique based on the method of Independent Component Analysis (ICA) to generate reliable independent data sets for different HIV therapies. We show that this technique takes advantage of the ICA power to eliminate the noise generated by artificial interaction of HIV system dynamics. Moreover, the incorporation of the actual laboratory data sets into the analysis phase offers a powerful advantage when compared with other mathematical procedures that consider the general behavior of HIV dynamics. RESULTS: The ICA algorithm has been used to generate different patterns of the HIV dynamics under different therapy conditions. The Kohonen Map has been used to eliminate redundant noise in each pattern to produce a reliable data set for the simulation phase. We show that under potent antiretroviral drugs, the value of the CD4+ cells in infected persons decreases gradually by about 11% every 100 days and the levels of the CD8+ cells increase gradually by about 2% every 100 days. AVAILABILITY: Executable code and data libraries are available by contacting the corresponding author. IMPLEMENTATION: Mathematica 4 has been used to simulate the suggested model. A Pentium III or higher platform is recommended.  (+info)

Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. (52/3813)

Systematic chemical genetics aims to explore the space representing interactions between small molecules and biological systems. Beyond measuring binding interactions and enzyme inhibition, measuring changes in the activity of proteins in intact signaling networks is necessary. Toward this end, we are partitioning chemical space into regions with different biological activities using a panel of cell-based assays and small molecule "chemical genetic modifiers." Herein, we report on the use of this methodology for the discovery of 617 small molecule inhibitors of histone deacetylases from a multidimensional screen of an encoded, diversity-oriented synthesis library. Following decoding of chemical tags and resynthesis, we demonstrate the selectivity of one inhibitory molecule (tubacin) toward alpha-tubulin deacetylation and another (histacin) toward histone deacetylation. These small molecules will facilitate dissecting the role of acetylation in a variety of cell biological processes.  (+info)

Use of phospholipid fatty acids to detect previous self-heating events in stored peat. (53/3813)

The use of the phospholipid fatty acid (PLFA) composition of microorganisms to detect previous self-heating events was studied in naturally self-heated peat and in peat incubated under temperature-controlled conditions. An increased content of total PLFAs was found in self-heated peat compared to that in unheated peat. Two PLFAs, denoted T1 and T2, were detected only in the self-heated peat. Incubation of peat samples at 25 to 55 degrees C for 4 days indicated that T1 and T2 were produced from microorganisms with different optimum temperatures. This was confirmed by isolation of bacteria at 55 degrees C, which produced T2 but not T1. These bacteria produced another PLFA (denoted T3) which coeluted with 18:1omega7. T2 and T3 were identified as omega-cyclohexyltridecanoic acid and omega-cyclohexylundecanoic acid, respectively, indicating that the bacteria belonged to the genus Alicyclobacillus: T1 was tentatively identified as omega-cycloheptylundecanoic acid. T2 was detected 8 h after the peat incubation temperature was increased to 55 degrees C, and maximum levels were found within 5 days of incubation. The PLFA 18:1omega7-T3 increased in proportion to T2. T1 was detected after 96 h at 55 degrees C, and its level increased throughout the incubation period, so that it eventually became one of the dominant PLFAs after 80 days. In peat samples incubated at 55 degrees C and then at 25 degrees C, T1 and T2 disappeared slowly. After 3 months, detectable levels were still found. Incubation at 25 degrees C after heating for 3 days at 55 degrees C decreased the amounts of T2 and 18:1omega7-T3 faster than did incubation at 5 degrees C. Thus, not only the duration and temperature during the heating event but also the storage temperature following heating are important for the detection of PLFAs indicating previous self-heating.  (+info)

Simultaneous gene clustering and subset selection for sample classification via MDL. (54/3813)

MOTIVATION: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids. RESULTS: Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data. AVAILABILITY: The methods presented in this paper have been implemented in the R language. The source code is available from the first author.  (+info)

Novel clustering algorithm for microarray expression data in a truncated SVD space. (55/3813)

MOTIVATION: This paper introduces the application of a novel clustering method to microarray expression data. Its first stage involves compression of dimensions that can be achieved by applying SVD to the gene-sample matrix in microarray problems. Thus the data (samples or genes) can be represented by vectors in a truncated space of low dimensionality, 4 and 5 in the examples studied here. We find it preferable to project all vectors onto the unit sphere before applying a clustering algorithm. The clustering algorithm used here is the quantum clustering method that has one free scale parameter. Although the method is not hierarchical, it can be modified to allow hierarchy in terms of this scale parameter. RESULTS: We apply our method to three data sets. The results are very promising. On cancer cell data we obtain a dendrogram that reflects correct groupings of cells. In an AML/ALL data set we obtain very good clustering of samples into four classes of the data. Finally, in clustering of genes in yeast cell cycle data we obtain four groups in a problem that is estimated to contain five families. AVAILABILITY: Software is available as Matlab programs at http://neuron.tau.ac.il/~horn/QC.htm.  (+info)

Use of multivariate analysis to assess the nutritional condition of fish larvae from nucleic acids and protein content. (56/3813)

The nutritional condition of turbot larvae (Scophthalmus maximus) was assessed by a multivariate analysis with DNA, RNA, and protein content as input variables. Special attention was given to the time that feeding began and to the timing and duration of starvation. The combination of the principal components analysis and the stepwise discriminant analysis, both techniques of multivariate analysis, made it possible to allocate the larvae into groups that were defined and identified based on similarities in developmental stage and nutritional condition. The developmental stage was mostly determined by the input variables DNA and protein content, while nutritional condition was determined by the RNA content. In the period studied, the more developed larvae were less resistant to starvation. Furthermore, when initial feeding was delayed as little as 6 h, the variables analyzed were markedly changed, and the effect on the deprived larvae was found to be equivalent to a 3-day delay in development-when compared with the larvae fed immediately after mouth opening. Through this technique, new samples of larvae with unknown history might be classified into groups, using their DNA, RNA, and protein content as input values in the defined classification functions. Results were compared to those obtained using RNA/DNA and RNA/dry weight indices, and the multivariate method was considered to be more sensitive and to provide extra information about larval nutritional history and development.  (+info)