Evolutionary HMMs: a Bayesian approach to multiple alignment. (25/5719)

MOTIVATION: We review proposed syntheses of probabilistic sequence alignment, profiling and phylogeny. We develop a multiple alignment algorithm for Bayesian inference in the links model proposed by Thorne et al. (1991, J. Mol. Evol., 33, 114-124). The algorithm, described in detail in Section 3, samples from and/or maximizes the posterior distribution over multiple alignments for any number of DNA or protein sequences, conditioned on a phylogenetic tree. The individual sampling and maximization steps of the algorithm require no more computational resources than pairwise alignment. METHODS: We present a software implementation (Handel) of our algorithm and report test results on (i) simulated data sets and (ii) the structurally informed protein alignments of BAliBASE (Thompson et al., 1999, Nucleic Acids Res., 27, 2682-2690). RESULTS: We find that the mean sum-of-pairs score (a measure of residue-pair correspondence) for the BAliBASE alignments is only 13% lower for Handelthan for CLUSTALW(Thompson et al., 1994, Nucleic Acids Res., 22, 4673-4680), despite the relative simplicity of the links model (CLUSTALW uses affine gap scores and increased penalties for indels in hydrophobic regions). With reference to these benchmarks, we discuss potential improvements to the links model and implications for Bayesian multiple alignment and phylogenetic profiling. AVAILABILITY: The source code to Handelis freely distributed on the Internet at http://www.biowiki.org/Handel under the terms of the GNU Public License (GPL, 2000, http://www.fsf.org./copyleft/gpl.html).  (+info)

InterProScan--an integration platform for the signature-recognition methods in InterPro. (26/5719)

InterProScan is a tool that scans given protein sequences against the protein signatures of the InterPro member databases, currently--PROSITE, PRINTS, Pfam, ProDom and SMART. The number of signature databases and their associated scanning tools as well as the further refinement procedures make the problem complex. InterProScan is designed to be a scalable and extensible system with a robust internal architecture. AVAILABILITY: The Perl-based InterProScan implementation is available from the EBI ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/) and the SRS-basedInterProScan is available upon request. We provide the public web interface (http://www.ebi.ac.uk/interpro/scan.html) as well as email submission server ([email protected]).  (+info)

The HMMTOP transmembrane topology prediction server. (27/5719)

The HMMTOP transmembrane topology prediction server predicts both the localization of helical transmembrane segments and the topology of transmembrane proteins. Recently, several improvements have been introduced to the original method. Now, the user is allowed to submit additional information about segment localization to enhance the prediction power. This option improves the prediction accuracy as well as helps the interpretation of experimental results, i.e. in epitope insertion experiments. AVAILABILITY: HMMTOP 2.0 is freely available to non-commercial users at http://www.enzim.hu/hmmtop. Source code is also available upon request to academic users.  (+info)

VISTRAJ: exploring protein conformational space. (28/5719)

VISTRAJ is an application which allows 3D visualization, manipulation and editing of protein conformational space using probabilistic maps of this space called 'trajectory distributions'. Trajectory distributions serve as input to FOLDTRAJ which samples protein structures based on the represented conformational space. VISTRAJ also allows FOLDTRAJ to be used as a tool for homology model creation, and structures may be generated containing post-translationally modified amino acids. AVAILABILITY: Binaries are freely available for non-profit use as part of the FOLDTRAJ package at ftp://ftp.mshri.on.ca/pub/TraDES/foldtraj/.  (+info)

BioLayout--an automatic graph layout algorithm for similarity visualization. (29/5719)

Graph layout is extensively used in the field of mathematics and computer science, however these ideas and methods have not been extended in a general fashion to the construction of graphs for biological data. To this end, we have implemented a version of the Fruchterman Rheingold graph layout algorithm, extensively modified for the purpose of similarity analysis in biology. This algorithm rapidly and effectively generates clear two (2D) or three-dimensional (3D) graphs representing similarity relationships such as protein sequence similarity. The implementation of the algorithm is general and applicable to most types of similarity information for biological data. AVAILABILITY: BioLayout is available for most UNIX platforms at the following web-site: http://www.ebi.ac.uk/research/cgg/services/layout.  (+info)

Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. (30/5719)

MOTIVATION: The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations. RESULTS: A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%. AVAILABILITY: The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint/ Source code is available upon request. CONTACT: [email protected].  (+info)

Clustering protein sequences--structure prediction by transitive homology. (31/5719)

MOTIVATION: It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood if transitivity always holds and whether transitivity can be extended ad infinitum. RESULTS: We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a directed graph, where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity, scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited though by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the self-similarity-of the scaling of the alignment scores appears to be an effective criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient library. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structural Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues. AVAILABILITY: The software is available to academic users on request from the authors. CONTACT: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]. SUPPLEMENTARY INFORMATION: http://www.zaik.uni-koeln.de/~schliep/ProtClust.html.  (+info)

Predicting class II MHC/peptide multi-level binding with an iterative stepwise discriminant analysis meta-algorithm. (32/5719)

MOTIVATION: Predicting peptides that bind to both Major Histocompatibility Complex (MHC) molecules and T cell receptors provides crucial information for vaccine development. An agretope is that portion of a peptide that interacts with an MHC molecule. The identification and prediction of agretopes is the first step towards vaccine design. RESULTS: An iterative stepwise discriminant analysis meta-algorithm is utilized to derive a quantitative motif for classifying potential agretopes as high-, moderate- or non-binders for HLA-DR1, a class II MHC molecule. A large molecular online database provides the input for this data-driven algorithm. The model correctly classifies over 85% of the peptides in the database. AVAILABILITY: Stepwise discriminant analysis software is available commercially in SPSS and BMDP statistical software packages. Peptides known to bind MHC molecules can be downloaded from http://wehih.wehi.edu.au/mhcpep/. Peptides known not to bind HLA-DR1 are available from the author upon request. CONTACT: [email protected].  (+info)