Designing hardware for protein sequence analysis. (49/179)

We present the architecture of PROSIDIS, a special purpose co-processor designed to search for the occurrence of substrings similar to a given 'template string' within a proteome. Actual tests show speed up figures ranging from 5 to 50 with respect to conventional general-purpose processors. AVAILABILITY: the PROSIDIS configuration file and the c code are available at http://www.enea.it/hpcn/php/rosato/  (+info)

Batch mode generation of residue-based diagrams of proteins. (50/179)

SUMMARY: Residue-based diagrams of proteins are graphical representations that can be used in protein information systems. These diagrams make it possible to visually integrate different types of biological information. The approach has been used successfully for membrane proteins. We developed the Residue-based diagram generator to (i) make it possible to generate residue-based diagrams of proteins in a batch mode that is compatible with the needs of information system curators, (ii) allow the generation of residue-based diagrams for families of soluble proteins or domains. AVAILABILITY: Licensed. Royalty free licenses are granted to non-profit institutions for educational and research purposes. http://icb.mssm.edu/crt/RbDg/index.xml  (+info)

Soap-HT-BLAST: high throughput BLAST based on Web services. (51/179)

SUMMARY: A high throughput Basic Local Alignment Search Tool (BLAST) system based on Web services is implemented. It provides an alternative BLAST service and allows users to perform multiple BLAST queries at one run in a distributed, parallel environment through the Internet. AVAILABILITY: It is available at http://mammoth.bii.a-star.edu.sg/webservices/htblast/index.html and at http://www.bii.a-star.edu.sg/jiren/download.html  (+info)

Parallel BLAST on split databases. (52/179)

SUMMARY: BLAST programs often run on large SMP machines where multiple threads can work simultaneously and there is enough memory to cache the databases between program runs. A group of programs is described which allows comparable performance to be achieved with a Beowulf configuration in which no node has enough memory to cache a database but the cluster as an aggregate does. To achieve this result, databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. AVAILABILITY: Source code is available from ftp://saf.bio.caltech.edu/  (+info)

A computational pipeline for protein structure prediction and analysis at genome scale. (53/179)

MOTIVATION: Experimental techniques alone cannot keep up with the production rate of protein sequences, while computational techniques for protein structure predictions have matured to such a level to provide reliable structural characterization of proteins at large scale. Integration of multiple computational tools for protein structure prediction can complement experimental techniques. RESULTS: We present an automated pipeline for protein structure prediction. The centerpiece of the pipeline is our threading-based protein structure prediction system PROSPECT. The pipeline consists of a dozen tools for identification of protein domains and signal peptide, protein triage to determine the protein type (membrane or globular), protein fold recognition, generation of atomic structural models, prediction result validation, etc. Different processing and prediction branches are determined automatically by a prediction pipeline manager based on identified characteristics of the protein. The pipeline has been implemented to run in a heterogeneous computational environment as a client/server system with a web interface. Genome-scale applications on Caenorhabditis elegans, Pyrococcus furiosus and three cyanobacterial genomes are presented. AVAILABILITY: The pipeline is available at http://compbio.ornl.gov/proteinpipeline/  (+info)

Classification of protein quaternary structure with support vector machine. (54/179)

MOTIVATION: Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes indispensable. The classification of protein quaternary structure based on the primary sequences can provide some useful information for the biologists. So a fully automatic and reliable classification system is needed. This work tries to look for the effective methods of extracting attribute and the algorithm for classifying the quaternary structure from the primary sequences. RESULTS: Both of the support vector machine (SVM) and the covariant discriminant algorithms have been first introduced to predict quaternary structure properties from the protein primary sequences. The amino acid composition and the auto-correlation functions based on the amino acid index profile of the primary sequence have been taken into account in the algorithms. We have analyzed 472 amino acid indices and selected the four amino acid indices as the examples, which have the best performance. Thus the five attribute parameter data sets (COMP, FASG, NISK, WOLS and KYTJ) were established from the protein primary sequences. The COMP attribute data set is composed of amino acid composition, and the FASG, NISK, WOLS and KYTJ attribute data sets are composed of the amino acid composition and the auto-correlation functions of the corresponding amino acid residue index. The overall accuracies of SVM are 78.5, 87.5, 83.2, 81.7 and 81.9%, respectively, for COMP, FASG, NISK, WOLS and KYTJ data sets in jackknife test, which are 19.6, 7.8, 15.5, 13.1 and 15.8%, respectively, higher than that of the covariant discriminant algorithm in the same test. The results show that SVM may be applied to discriminate between the primary sequences of homodimers and non-homodimers and the two protein sequence descriptors can reflect the quaternary structure information. Compared with previous Robert Garian's investigation, the performance of SVM is almost equal to that of the Decision tree models, and the methods of extracting feature vector from the primary sequences are superior to Robert's binning function method. AVAILABILITY: Programs are available on request from the authors.  (+info)

A strategy for assembling the maize (Zea mays L.) genome. (55/179)

Because the bulk of the maize (Zea mays L.) genome consists of repetitive sequences, sequencing efforts are being targeted to its 'gene-rich' fraction. Traditional assembly programs are inadequate for this approach because they are optimized for a uniform sampling of the genome and inherently lack the ability to differentiate highly similar paralogs. RESULTS: We report the development of bioinformatics tools for the accurate assembly of the maize genome. This software, which is based on innovative parallel algorithms to ensure scalability, assembled 730,974 genomic survey sequences fragments in 4 h using 64 Pentium III 1.26 GHz processors of a commodity cluster. Algorithmic innovations are used to reduce the number of pairwise alignments significantly without sacrificing quality. Clone pair information was used to estimate the error rate for improved differentiation of polymorphisms versus sequencing errors. The assembly was also used to evaluate the effectiveness of various filtering strategies and thereby provide information that can be used to focus subsequent sequencing efforts.  (+info)

FGDP: functional genomics data pipeline for automated, multiple microarray data analyses. (56/179)

Gene expression microarrays and oligonucleotide GeneChips have provided biologists with a means of measuring, in a single experiment, the expression levels of entire genomes under a variety of conditions. As with any nascent field, there is no single accepted method for analyzing the new data types, with new methods appearing monthly. Investigators using the new technology must constantly seek access to the latest tools and explore their data in multiple ways. The functional genomics data pipeline provides an integrated, extendable analysis environment permitting multiple, simultaneous analyses to be automatically performed and provides a web server and interface for presenting results. AVAILABILITY: Source code and executables are available under the GNU public license at http://bioinformatics.fccc.edu/  (+info)