Computer-aided detection and diagnosis at the start of the third millennium. (65/3779)

Computer-aided diagnosis has been under development for more than 3 decades. The rate of progress appears exponential, with either recent approval or pending approval for devices focusing on mammography, chest radiographs, and chest CT. Related technologies improve diagnosis for many other types of medical images including virtual colonography, vascular imaging, as well as automated quantitation of image-derived metrics. A variety of techniques are currently employed with success, likely reflecting the variety of imagery used, as well as the variety of tasks. Most areas of medical imaging have had efforts at computer assistance, and some have even received FDA approval and can be reimbursed. We anticipate that the rapid advance of these technologies will continue, and that application will broaden to cover much of medical imaging. Acceptance of, and integration of computer-aided diagnosis technology with the electronic radiology practice is a current challenge. These challenges will be overcome, and we expect that computer-aided diagnosis will be routinely applied to medical images.  (+info)

Bayesian automatic relevance determination algorithms for classifying gene expression data. (66/3779)

MOTIVATION: We investigate two new Bayesian classification algorithms incorporating feature selection. These algorithms are applied to the classification of gene expression data derived from cDNA microarrays. RESULTS: We demonstrate the effectiveness of the algorithms on three gene expression datasets for cancer, showing they compare well with alternative kernel-based techniques. By automatically incorporating feature selection, accurate classifiers can be constructed utilizing very few features and with minimal hand-tuning. We argue that the feature selection is meaningful and some of the highlighted genes appear to be medically important.  (+info)

Multiple sequence alignment with arbitrary gap costs: computing an optimal solution using polyhedral combinatorics. (67/3779)

Multiple sequence alignment is one of the dominant problems in computational molecular biology. Numerous scoring functions and methods have been proposed, most of which result in NP-hard problems. In this paper we propose for the first time a general formulation for multiple alignment with arbitrary gap-costs based on an integer linear program (ILP). In addition we describe a branch-and-cut algorithm to effectively solve the ILP to optimality. We evaluate the performances of our approach in terms of running time and quality of the alignments using the BAliBase database of reference alignments. The results show that our implementation ranks amongst the best programs developed so far.  (+info)

Feature subset selection for splice site prediction. (68/3779)

MOTIVATION: The large amount of available annotated Arabidopsis thaliana sequences allows the induction of splice site prediction models with supervised learning algorithms (see Haussler (1998) for a review and references). These algorithms need information sources or features from which the models can be computed. For splice site prediction, the features we consider in this study are the presence or absence of certain nucleotides in close proximity to the splice site. Since it is not known how many and which nucleotides are relevant for splice site prediction, the set of features is chosen large enough such that the probability that all relevant information sources are in the set is very high. Using only those features that are relevant for constructing a splice site prediction system might improve the system and might also provide us with useful biological knowledge. Using fewer features will of course also improve the prediction speed of the system. RESULTS: A wrapper-based feature subset selection algorithm using a support vector machine or a naive Bayes prediction method was evaluated against the traditional method for selecting features relevant for splice site prediction. Our results show that this wrapper approach selects features that improve the performance against the use of all features and against the use of the features selected by the traditional method. AVAILABILITY: The data and additional interactive graphs on the selected feature subsets are available at http://www.psb.rug.ac.be/gps  (+info)

Contextual alignment of biological sequences (Extended abstract). (69/3779)

We present a model of contextual alignment of biological sequences. It is an extension of the classical alignment, in which we assume that the cost of a substitution depends on the surrounding symbols. In this model the cost of transforming one sequence into another depends on the order of editing operations. We present efficient algorithms for calculating this cost, as well as reconstructing (the representation of) all the orders of operations which yield this optimal cost. A precise characterization of the families of linear orders which can emerge this way is given.  (+info)

An automatic block and spot indexing with k-nearest neighbors graph for microarray image analysis. (70/3779)

MOTIVATION: In this paper, we propose a fully automatic block and spot indexing algorithm for microarray image analysis. A microarray is a device which enables a parallel experiment of ten to hundreds of thousands of test genes in order to measure gene expression. Due to this huge size of experimental data, automated image analysis is gaining importance in microarray image processing systems. Currently, most of the automated microarray image processing systems require manual block indexing and, in some cases, spot indexing. If the microarray image is large and contains a lot of noise, it is very troublesome work. In this paper, we show it is possible to locate the addresses of blocks and spots by applying the Nearest Neighbors Graph Model. Also, we propose an analytic model for the feasibility of block addressing. Our analytic model is validated by a large body of experimental results. RESULTS: We demonstrate the features of automatic block detection, automatic spot addressing, and correction of the distortion and skewedness of each microarray image.  (+info)

ProClust: improved clustering of protein sequences with an extended graph-based approach. (71/3779)

MOTIVATION: The problem of finding remote homologues of a given protein sequence via alignment methods is not fully solved. In fact, the task seems to become more difficult with more data. As the size of the database increases, so does the noise level; the highest alignment scores due to random similarities increase and can be higher than the alignment score between true homologues. Comparing two sequences with an arbitrary alignment method yields a similarity value which may indicate an evolutionary relationship between them. A threshold value is usually chosen to distinguish between true homologue relationships and random similarities. To compensate for the higher probability of spurious hits in larger databases, this threshold is increased. Increasing specificity however leads to decreased sensitivity as a matter of principle. Sensitivity can be recovered by utilizing refined protocols. A number of approaches to this challenge have made use of the fact that proteins are often members of some larger protein family. This can be exploited by using position-specific substitution matrices or profiles, or by making use of transitivity of homology. Transitivity refers to the concept of concluding homology between proteins A and C based on homology between A and a third protein B and between B and C. It has been demonstrated that transitivity can lead to substantial improvement in recognition of remote homologues particularly in cases where the alignment score of A and C is below the noise level. A natural limit to the use of transitivity is imposed by domains. Domains, compact independent sub-units of proteins, are often shared between otherwise distinct proteins, and can cause substantial problems by incorrectly linking otherwise unrelated proteins. RESULTS: We extend a graph-based clustering algorithm which uses an asymmetric distance measure, scaling similarity values based on the length of the protein sequences compared. Additionally, the significance of alignment scores is taken into account and used for a filtering step in the algorithm. Post-processing, to merge further clusters based on profile HMMs is proposed. SCOP sequences and their super-family level classification are used as a test set for a clustering computed with our method for the joint data set containing both SCOP and SWISS-PROT. Note, the joint data set includes all multi-domain proteins, which contain the SCOP domains that are a potential source of incorrect links. Our method compares at high specificities very favorably with PSI-Blast, which is probably the most widely-used tool for finding remote homologues. We demonstrate that using transitivity with as many as twelve intermediate sequences is crucial to achieving this level of performance. Moreover, from analysis of false positives we conclude that our method seems to correctly bound the degree of transitivity used. This analysis also yields explicit guidance in choosing parameters. The heuristics of the asymmetric distance measure used neither solve the multi-domain problem from a theoretical point of view, nor do they avoid all types of problems we have observed in real data. Nevertheless, they do provide a substantial improvement over existing approaches. AVAILABILITY: The complete software source is freely available to all users under the GNU General Public License (GPL) from http://www.bioinformatik.uni-koeln.de/~proclust/download/  (+info)

An approximate matching algorithm for finding (sub-)optimal sequences in S-attributed grammars. (72/3779)

MOTIVATION: S-attributed grammars (a generalization of classical Context-Free grammars) provide a versatile formalism for sequence analysis which allows to express long range constraints: the RNA folding problem is a typical example of application. Efficient algorithms have been developed to solve problems expressed with these tools, which generally compute the optimal attribute of the sequence w.r.t. the grammar. However, it is often more meaningful and/or interesting from the biological point of view to consider almost optimal attributes as well as approximate sequences; we thus need more flexible and powerful algorithms able to perform these generalized analyses. RESULTS: In this paper we present a basic algorithm which, given a grammar G and a sequence omega, computes the optimal attribute for all (approximate) strings omega(') in L(G) such that d(omega, omega(')) < or = M, and whose complexity is O(n(r + 1)) in time and O(n(2)) in space (r is the maximal length of the right-hand side of any production of G). We will also give some extensions and possible improvements of this algorithm.  (+info)