Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. (73/179)

Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area: first, how to take into account the extremely complicated sequence-order effects, and second, how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino acid composition to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has remarkable power in dealing with the bias caused by the situation when one subset in the training dataset contains many more samples than the other. The new method is particularly useful when our focus is aimed at proteins belonging to small subsets. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, indicating that the current approach may serve as a powerful complementary tool to other existing methods for predicting the types of membrane proteins.  (+info)

Analysis and prediction of leucine-rich nuclear export signals. (74/179)

We present a thorough analysis of nuclear export signals and a prediction server, which we have made publicly available. The machine learning prediction method is a significant improvement over the generally used consensus patterns. Nuclear export signals (NESs) are extremely important regulators of the subcellular location of proteins. This regulation has an impact on transcription and other nuclear processes, which are fundamental to the viability of the cell. NESs are studied in relation to cancer, the cell cycle, cell differentiation and other important aspects of molecular biology. Our conclusion from this analysis is that the most important properties of NESs are accessibility and flexibility allowing relevant proteins to interact with the signal. Furthermore, we show that not only the known hydrophobic residues are important in defining a nuclear export signals. We employ both neural networks and hidden Markov models in the prediction algorithm and verify the method on the most recently discovered NESs. The NES predictor (NetNES) is made available for general use at http://www.cbs.dtu.dk/.  (+info)

A framework for a distributed, hybrid, multiple-ontology clinical-guideline library, and automated guideline-support tools. (75/179)

Clinical guidelines are a major tool in improving the quality of medical care. However, most guidelines are in free text, not in a formal, executable format, and are not easily accessible to clinicians at the point of care. We introduce a Web-based, modular, distributed architecture, the Digital Electronic Guideline Library (DeGeL), which facilitates gradual conversion of clinical guidelines from text to a formal representation in chosen target guideline ontology. The architecture supports guideline classification, semantic markup, context-sensitive search, browsing, run-time application, and retrospective quality assessment. The DeGeL hybrid meta-ontology includes elements common to all guideline ontologies, such as semantic classification and domain knowledge; it also includes four content-representation formats: free text, semi-structured text, semi-formal representation, and a formal representation. These formats support increasingly sophisticated computational tasks. The DeGeL tools for support of guideline-based care operate, at some level, on all guideline ontologies. We have demonstrated the feasibility of the architecture and the tools for several guideline ontologies, including Asbru and GEM.  (+info)

A high productivity/low maintenance approach to high-performance computation for biomedicine: four case studies. (76/179)

The rapid advances in high-throughput biotechnologies such as DNA microarrays and mass spectrometry have generated vast amounts of data ranging from gene expression to proteomics data. The large size and complexity involved in analyzing such data demand a significant amount of computing power. High-performance computation (HPC) is an attractive and increasingly affordable approach to help meet this challenge. There is a spectrum of techniques that can be used to achieve computational speedup with varying degrees of impact in terms of how drastic a change is required to allow the software to run on an HPC platform. This paper describes a high- productivity/low-maintenance (HP/LM) approach to HPC that is based on establishing a collaborative relationship between the bioinformaticist and HPC expert that respects the former's codes and minimizes the latter's efforts. The goal of this approach is to make it easy for bioinformatics researchers to continue to make iterative refinements to their programs, while still being able to take advantage of HPC. The paper describes our experience applying these HP/LM techniques in four bioinformatics case studies: (1) genome-wide sequence comparison using Blast, (2) identification of biomarkers based on statistical analysis of large mass spectrometry data sets, (3) complex genetic analysis involving ordinal phenotypes, (4) large-scale assessment of the effect of possible errors in analyzing microarray data. The case studies illustrate how the HP/LM approach can be applied to a range of representative bioinformatics applications and how the approach can lead to significant speedup of computationally intensive bioinformatics applications, while making only modest modifications to the programs themselves.  (+info)

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters. (77/179)

BACKGROUND: Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. RESULTS: We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. CONCLUSIONS: Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist.  (+info)

Evaluation of parallel decomposition methods for biomechanical optimizations. (78/179)

As the complexity of musculoskeletal models continues to increase, so will the computational demands of biomechanical optimizations. For this reason, parallel biomechanical optimizations are becoming more common. Most implementations parallelize the optimizer. In this study, an alternate approach is investigated that parallelizes the analysis function (i.e., a kinematic or dynamic simulation) called repeatedly by the optimizer to calculate the cost function and constraints. To evaluate this approach, a system identification problem involving a kinematic ankle joint model was solved using a gradient-based optimizer and three parallel decomposition methods: gradient calculation decomposition, analysis function decomposition, or both methods combined. For a given number of processors, analysis function decomposition exhibited the best performance despite the highest communication and synchronization overhead, while gradient calculation decomposition demonstrated the worst performance due to the fact that the necessary line searches were not performed in parallel. These findings suggest that the method of parallelization most commonly used for biomechanical optimizations may not be the most efficient, depending on the optimization algorithm used. In many applications, the best computational strategy may be to focus on parallelizing the analysis function.  (+info)

A parallel graph decomposition algorithm for DNA sequencing with nanopores. (79/179)

MOTIVATION: With the potential availability of nanopore devices that can sense the bases of translocating single-stranded DNA (ssDNA), it is likely that 'reads' of length approximately 10(5) will be available in large numbers and at high speed. We address the problem of complete DNA sequencing using such reads. We assume that approximately 10(2) copies of a DNA sequence are split into single strands that break into randomly sized pieces as they translocate the nanopore in arbitrary orientations. The nanopore senses and reports each individual base that passes through, but all information about orientation and complementarity of the ssDNA subsequences is lost. Random errors (both biological and transduction) in the reads create further complications. RESULTS: We have developed an algorithm that addresses these issues. It can be considered an extreme variation of the well-known Eulerian path approach. It searches over a space of de Bruijn graphs until it finds one in which (a) the impact of errors is eliminated and (b) both possible orientations of the two ssDNA sequences can be identified separately and unambiguously. Our algorithm is able to correctly reconstruct real DNA sequences of the order of 10(6) bases (e.g. the bacterium Mycoplasma pneumoniae) from simulated erroneous reads on a modest workstation in about 1 h. We describe, and give measured timings of, a parallel implementation of this algorithm on the Cray Multithreaded Architecture (MTA-2) supercomputer, whose architecture is ideally suited to this 'unstructured' problem. Our parallel implementation is crucial to the problem of rapidly sequencing long DNA sequences and also to the situation where multiple nanopores are used to obtain a high-bandwidth stream of reads.  (+info)

DSEARCH: sensitive database searching using distributed computing. (80/179)

We present a distributed and fully cross-platform database search program that allows the user to utilize the idle clock cycles of machines to perform large searches using the most sensitive algorithms. For those in an academic or corporate environment with hundreds of idle desktop machines, DSEARCH can deliver a 'free' database search supercomputer. AVAILABILITY: The software is publicly available under the GNU general public licence from http://www.cs.may.ie/distributed CONTACT: [email protected] SUPPLEMENTARY INFORMATION: Full documentation and a user manual is available from http://www.cs.may.ie/distributed.  (+info)