Relatedness of baculovirus and gypsy retrotransposon envelope proteins.
BACKGROUND: Current evidence suggests that lepidopteran baculoviruses may be divided into two phylogenetic groups based on their envelope fusion proteins. One group utilizes gp64, a low pH-dependent envelope fusion protein, whereas the other employs a protein family (e.g. LD130 in the Lymantria dispar nucleopolyhedrovirus) unrelated to gp64, but that is also low pH-dependent. Database searches with members of the LD130 protein family often record significant levels of homology to envelope proteins from a number of insect retrovirus-like transposable elements of the gypsy class. In this report, the significance of the homology between these two types of envelope proteins is analyzed. RESULTS: The significance of the alignment scores was evaluated using Z-scores that were calculated by comparing the observed alignment score to the distribution of scores obtained for alignments after one of the sequences was subjected to 100 random shuffles of its sequence. These analyses resulted in Z-scores of >9 for members of the LD130 family when compared to most gypsy envelope proteins. Furthermore, in addition to significant levels of sequence homology and the presence of predicted signal sequences and transmembrane domains, members of this family contain a possible a furin cleavage motif, a conserved motif downstream of this site, predicted coiled-coil domains, and a pattern of conserved cysteine residues. CONCLUSIONS: These analyses provide a link between envelope proteins from a group of insect retrovirus-like elements and a baculovirus protein family that includes low-pH-dependent envelope fusion proteins. The ability of gypsy retroelements to transpose from insect into baculovirus genomes suggests a pathway for the exchange of this protein between these viral families. (+info
Cultural aspects of cancer genetics: setting a research agenda.
BACKGROUND: Anecdotal evidence suggests that people from non-Anglo-Celtic backgrounds are under-represented at familial cancer clinics in the UK, the USA, and Australia. This article discusses cultural beliefs as a potential key barrier to access, reviews previous empirical research on cultural aspects of cancer genetics, draws implications from findings, and sets a research agenda on the inter-relationships between culture, cancer genetics, and kinship. METHODS: The CD-ROM databases MEDLINE, PsychLIT, CINAHL, and Sociological Abstracts were searched from 1980 onwards. RESULTS: Cultural aspects of cancer genetics is the focus of an emerging body of publications. Almost all studies assessed African-American women with a family history of breast cancer and few studies included more diverse samples, such as Americans of Ashkenazi Jewish background or Hawaiian- and Japanese-Americans. Our analysis of published reports suggests several directions for future research. First, an increased focus on various Asian societies appears warranted. Research outside North America could explore the extent to which findings can be replicated in other multicultural settings. In addition, control group designs are likely to benefit from systematically assessing culture based beliefs and cultural identity in the "majority culture" group used for comparative purposes. CONCLUSION: More data on which to base the provision of culturally appropriate familial cancer clinic services to ethnically diverse societies are needed. Empirical data will assist with culturally appropriate categorisation of people from other cultures into risk groups based on their family histories and provide the basis for the development of culturally appropriate patient education strategies and materials. (+info
An efficient algorithm for finding short approximate non-tandem repeats.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences. (+info
Fast optimal leaf ordering for hierarchical clustering.
We present the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering. Hierarchical clustering has been extensively used to analyze gene expression data, and we show how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method. For a tree with n leaves, there are 2(n-1) linear orderings consistent with the structure of the tree. Our optimal leaf ordering algorithm runs in time O(n(4)), and we present further improvements that make the running time of our algorithm practical. (+info
GEST: a gene expression search tool based on a novel Bayesian similarity metric.
Gene expression array technology has made possible the assay of expression levels of tens of thousands of genes at a time; large databases of such measurements are currently under construction. One important use of such databases is the ability to search for experiments that have similar gene expression levels as a query, potentially identifying previously unsuspected relationships among cellular states. Such searches depend crucially on the metric used to assess the similarity between pairs of experiments. The complex joint distribution of gene expression levels, particularly their correlational structure and non-normality, make simple similarity metrics such as Euclidean distance or correlational similarity scores suboptimal for use in this application. We present a similarity metric for gene expression array experiments that takes into account the complex joint distribution of expression values. We provide a computationally tractable approximation to this measure, and have implemented a database search tool based on it. We discuss implementation issues and efficiency, and we compare our new metric to other standard metrics. (+info
New approaches for reconstructing phylogenies from gene order data.
We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighbor-joining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation. (+info
Designing fast converging phylogenetic methods.
Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover the true tree with high probability from sequences that grow only polynomially in the number of leaves, once the edge lengths are bounded arbitrarily from above and below. Only a few methods have been determined to be absolute fast converging; these have all been developed in just the last few years, and most are polynomial time. In this paper, we compare pre-existing fast converging methods as well as some new polynomial time methods that we have developed. Our study, based upon simulating evolution under a wide range of model conditions, establishes that our new methods outperform both neighbor joining and the previous fast converging methods, returning very accurate large trees, when these other methods do poorly. (+info
Inferring subnetworks from perturbed expression profiles.
Genome-wide expression profiles of genetic mutants provide a wide variety of measurements of cellular responses to perturbations. Typical analysis of such data identifies genes affected by perturbation and uses clustering to group genes of similar function. In this paper we discover a finer structure of interactions between genes, such as causality, mediation, activation, and inhibition by using a Bayesian network framework. We extend this framework to correctly handle perturbations, and to identify significant subnetworks of interacting genes. We apply this method to expression data of S. cerevisiae mutants and uncover a variety of structured metabolic, signaling and regulatory pathways. (+info