MOTIVATION: Identification of novel G protein-coupled receptors and other multi-transmembrane proteins from genomic databases using structural features. RESULTS: Here we describe a new algorithm for identifying multi-transmembrane proteins from genomic databases with a specific application to identifying G protein-coupled receptors (GPCRs) that we call quasi-periodic feature classifier (QFC). The QFC algorithm uses concise statistical variables as the 'feature space' to characterize the quasi-periodic physico-chemical properties of multi-transmembrane proteins. For the case of identifying GPCRs, the variables are then used in a non-parametric linear discriminant function to separate GPCRs from non-GPCRs. The algorithm runs in time linearly proportional to the number of sequences, and performance on a test dataset shows 96% positive identification of known GPCRs. The QFC algorithm also works well with short random segments of proteins and it positively identified GPCRs at a level greater than 90% even with segments as short as 100 amino acids. The primary advantage of the algorithm is that it does not directly use primary sequence patterns which may be subject to sampling bias. The utility of the new algorithm has been demonstrated by the isolation from the Drosophila genome project database of a novel class of seven-transmembrane proteins which were shown to be the elusive olfactory receptor genes of Drosophila.

MOTIVATION: In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS). RESULTS: The task of finding TIS can be modeled as a classification problem. We demonstrate the applicability of support vector machines for this task, and show how to incorporate prior biological knowledge by engineering an appropriate kernel function. With the described techniques the recognition performance can be improved by 26% over leading existing approaches. We provide evidence that existing related methods (e.g. ESTScan) could profit from advanced TIS recognition.

Comparative analysis is one of the most powerful methods available for understanding the diverse and complex systems found in biology, but it is often limited by a lack of comprehensive taxonomic sampling. Despite the recent development of powerful genome technologies capable of producing sequence data in large quantities (witness the recently completed first draft of the human genome), there has been relatively little change in how evolutionary studies are conducted. The application of genomic methods to evolutionary biology is a challenge, in part because gene segments from different organisms are manipulated separately, requiring individual purification, cloning, and sequencing. We suggest that a feasible approach to collecting genome-scale data sets for evolutionary biology (i.e., evolutionary genomics) may consist of combination of DNA samples prior to cloning and sequencing, followed by computational reconstruction of the original sequences. This approach will allow the full benefit of automated protocols developed by genome projects to be realized; taxon sampling levels can easily increase to thousands for targeted genomes and genomic regions. Sequence diversity at this level will dramatically improve the quality and accuracy of phylogenetic inference, as well as the accuracy and resolution of comparative evolutionary studies. In particular, it will be possible to make accurate estimates of normal evolution in the context of constant structural and functional constraints (i.e., site-specific substitution probabilities), along with accurate estimates of changes in evolutionary patterns, including pairwise coevolution between sites, adaptive bursts, and changes in selective constraints. These estimates can then be used to understand and predict the effects of protein structure and function on sequence evolution and to predict unknown details of protein structure, function, and functional divergence. In order to demonstrate the practicality of these ideas and the potential benefit for functional genomic analysis, we describe a pilot project we are conducting to simultaneously sequence large numbers of vertebrate mitochondrial genomes.

The model eukaryote Saccharomyces cerevisiae (budding yeast) has provided significant insight into sterol homeostasis. The study of sterol metabolism in a genetically amenable model organism such as yeast is likely to have an even greater impact and relevance to human disease with the advent of the complete human genome sequence. In addition to definition of the sterol biosynthetic pathway, almost to completion, the remarkable conservation of other components of sterol homeostasis are described in this review.

Xylella fastidiosa, a pathogen of citrus, is the first plant pathogenic bacterium for which the complete genome sequence has been published. Inspection of the sequence reveals high relatedness to many genes of other pathogens, notably Xanthomonas campestris. Based on this, we suggest that Xylella possesses certain easily testable properties that contribute to pathogenicity. We also present some general considerations for deriving information on pathogenicity from bacterial genomics.

The vertebrate genome contains a predicted 50 000-100 000 genes, many of unknown function. The recent development of morpholino-based gene knock-down technology in zebrafish has opened the door to the genome-wide assignment of function based on sequence in a model vertebrate. This review describes technical aspects of morpholino use for functional genomics applications, including the potential for multigene targeting and known methodological limitations. The result of successful gene inactivation by this agent is proposed to yield embryos with a 'morphant' phenotypic designation. The establishment of a morphant database opens the door to true functional genomics using the vertebrate, Danio rerio.

One could almost say that it is the latest fashion to sequence a bacterial genome. However, this would belittle the efforts of those working on these important organisms, whose data will greatly help those working on the prevention of disease in the fields of medicine and agriculture. In this feature we present a guided tour of the latest additions to the 'sequenced microbes' club. Vibrio cholerae is the causative agent of cholera, which is still a threat in countries with poor sanitation and unsafe drinking water. Pseudomonas aeruginosa is responsible for a large proportion of opportunistic human infections, typically infecting those with compromised immune systems, particularly cystic fibrosis patients, those patients on respirators and burn victims. Xylella fastidiosa is a plant pathogen that attacks citrus fruits by blocking the xylem, resulting in juiceless fruits of no commercial value.

The meeting was held on 16-20 July 2000 at the International Convention Centre in Birmingham, UK, and was co-organized by the International Union of Biochemistry and Molecular Biology (IUBMB) and the Federation of European Biochemical Societies (FEBS). Although the meeting had a broad subject area, the emphasis was firmly placed on post-genomic studies, and hence several sessions should be of interest to our readers. We provide highlights of these sessions, bringing you a report on the most exciting and informative presentations.