2011年10月27日星期四

HyPhy - Hypothesis testing using phylogenies

1. HyPhy is a scriptable package that can fit statistical evolutionary models to alignment of homologous sequences using Maximum likelihood 2), estimate various parameters that have biological meaning, for example branch lengths, substitution rates, dN/dS ratios, recombination breakpoints, and test hypotheses about how sequences in the alignment have evolved. HyPhy focuses on inference about the evolutionary process. Even though it can do limited alignment and phylogenetic reconstruction, much better specialized programs exist for these purposes.
Here are some of the applications that HyPhy is often used for:
  • Positive and negative selection detection
  • Recombination analysis
  • Detecting co-evolving residues
  • Genomic and multiple-gene evolutionary inference
  • Molecular clock and relative rate tests
  • Nucleotide, protein and codon model selection
  • As a likelihood analysis engine for other software and web services
  • One-off analyses: tasks that no other package does out of the box and are not worth writing a specialized program for
http://www.datam0nk3y.org/hyphy/doku.php

2. Some of the most popular HyPhy functions (recombination, positive selection detection, etc) are implemented in a web-server hosted at http://www.datamonkey.org

Which codon sites are under diversifying positive or negative selection?
Three different codon-based maximum likelihood methods, SLAC, FEL and REL, can be used estimate the dN/dS (also known as Ka/Ks or ω) ratio at every codon in the alignment. An exhaustive discussion of each approach can be found in the methodology paper. All methods can also take recombination into account. This is done by screening the sequences for recombination breakpoints, identifying non-recombinant regions and allowing each to have its own phylogentic tree.
Is there evidence of selection in my alignment?
The PARRIS method, developed by Konrad Scheffler and colleagues, extends traditional codon-based likelihood ratio tests to detect if a proportion of sites in the alignment evolve with dN/dS>1. The method takes recombination and synonymous rate variation into account.
What is the evolutionary fingerprint of a gene?
The ESD method, described in a recent paper, fits a versatile general discrete bivariate model of site-by-site selective force variation to partition all sites into selective classes, and obtains an approximate posterior distribution of this partititoning. The resulting "noisy" distribution of selective regimes is the evolutionary fingerprint of a gene. The EVF (evolutionary fingerprinting) module implements this procedure, and can also infer which individual sites appear to be positively selected while accounting for parameter estimation error (analogous to the BEB methodology of the PAML package).
Which codon sites are under positive or negative selection at the population level?
The codon-based maximum likelihood IFEL method can investigate whether sequences sampled from a population (e.g. viral sequences from different hosts) have been subject to selective pressure at the population level (i.e. along internal branches). A discussion of the method and its application can be found here
Did selective pressure vary along lineages, i.e. over time?
The codon-based genetic algorithm GABranch method can automatically partition all branches of the phylogeny describing non-recombinant data into groups according to dN/dS. Robust multi-model inference is used to collate results from all models examined during the run to provide confidence intervals on dN/dS for each branch and guard against model misspecification and overfitting (method details).
How about episodic diversifying selection (branch-site methods)? Using the modeling framework, which allows the efficient estimations with models which permit dN/dS variation along both sites and lineages, Datamonkey implements two tests geared towards finding lineages and sites subject to episodic diversifying selection (EDS).
The Branch-site REL method, identifies those branches where a proportion of sites evolves under EDS. If you are primarily interested in finding which lineages (but don't care about which sites) have experienced EDS, use this method. Alternatively, if you are interested in sites (but don't care about which lineages) subject to EDS, then the MEME method is appropriate.
What about different types of selection?
Protein sequences can be screened for evidence of directional using the DEPS method, described here, useful when one wants to detect convergent evolution or selective sweeps. For coding sequences, the TOGGLE model, developed by Wayne Delport and colleagues, can detect selection-driven changes that result in amino-acid toggling. A canonical example of this can be found in immune-driven evolution of HIV-1 (escape and reversion).
Which evolutionary model should I use for my data?
For each type of data, nucleotide, amino-acid and codon, Datamonkey implements separate model selection procedures. An exhaustive search is performed for all possible (Markov, time-reversible) models of nucleotide evolution. For protein data, a collection of published empirical models are fitted to the alignment and the best one is selected using AICc. Finally, for coding data, a sophisticated genetic-algorithm procedure described in our recent paper is used to examine thousands of potential models and report the best one and various metrics based on the set of credible models - this feature is implemented in the CMS module.
Did any sites co-evolve?
A Bayesian graphical model is deduced from reconstructed substitutions at each branch/site combination to infer conditional evolutionary dependancies of sites in the alignments, i.e. whether a site is more or less likely to experience a non-synonymous substitution at a branch when certain other sites do (or do not) experience non-synonymous substitutions at the same branch. The SPIDERMONKEY method was introduced in the evolutionary context in our paper on the evolution of the phenotypically important and highly variable V3 loop of the envelope glycoprotein in HIV-1.
Has recombination acted upon sequences in an alignment?
Recombination leaves an imprint on sequence alignments: different segments of the alignment may be described by different phylogenetic trees, called phylogenetic discordance. Datamonkey.org implementes two methods: SBP, suitable for answering the question "Is there evidence of recombination in the alignment?", and GARD, that attempts to find all the recombination breakpoints. Both method are described in this paper. The output of GARD is accepted by most other analyses, and because recombination can mislead phylogenetic analysis that do not account for it, we strongly urge that recombination testing be done on any alignment that is going to be analyzed for positive selection.You can also submit a collection of HIV-1 sequences for recombination screening by a specialized recombination detection algorithm SCUEAL described in this paper.
What were the ancestral sequences?
The ASR module implements three different approaches to reconstructing ancestral sequences: joint, marginal and sampled - see this paper for a description and original methodology attribution, from simple or partitioned alignments.
3. One functionality from HyPhy:

A random effects branch-site model for detecting episodic diversifying selection

http://mbe.oxfordjournals.org/content/early/2011/06/11/molbev.msr125.abstract

没有评论:

发表评论