2011年8月28日星期日

population structure - the number of inferred clusters depends on sample size

Clustering methods have been used extensively to unravel cryptic population genetic structure. We investigated the effect of the number of individuals sampled in each location on the resulting number of clusters. Our study was motivated by recent results in Arabidopsis thaliana: studies in which more than one individual was sampled per location apparently have led to a much higher number of clusters than studies where only one individual was sampled in each location, as is generally done in this species. We show, using computer simulations and microsatellite data in A. thaliana, that the number of sampled individuals indeed has a strong impact on the number of resulting clusters. This effect is smaller if the sampled populations have a hierarchical structure. In most cases, sampling 5–10 individuals per population should be enough. The results argue for abandoning the concept of ‘accessions’ in partially selfing organisms.

http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2009.02756.x/full


2011年8月25日星期四

GCTA (Genome-wide Complex Trait Analysis)

http://gump.qimr.edu.au/gcta/



GCTA (Genome-wide Complex Trait Analysis) is designed to estimate the proportion of phenotypic variance explained by genome- or chromosome-wide SNPs for complex traits. GCTA was developed by Jian Yang, Hong Lee, Mike Goddard and Peter Visscher and is maintained in Peter Visscher's lab at the Queensland Institute of Medical Research . GCTA currently supports the following functionalities:
  • Estimate the genetic relationship from genome-wide SNPs;
  • Estimate the inbreeding coefficient from genome-wide SNPs;
  • Estimate the variance explained by all the autosomal SNPs;
  • Partition the genetic variance onto individual chromosomes;
  • Estimate the genetic variance associated with the X-chromosome;
  • Test the effect of dosage compensation on the X-chromosome;
  • Predict the genome-wide additive genetic effects for individual subjects and for individual SNPs;
  • Estimate the LD structure encompassing a list of target SNPs;
  • Simulate GWAS data based upon the observed genotype data.
  • Convert Illumina raw genotype data into PLINK PED format.

2011年8月19日星期五

Logistic random effects regression models: a comparison of statistical packages for binary and ordinal outcomes

different statistical packages were compared on logistic random effects regression models.

Abstract: Background: Logistic random effects models are a popular tool to analyze multilevel also called hierarchical data with a binary or ordinal outcome. Here, we aim to compare different statistical software implementations of these models.

Methods: We used individual patient data from 8509 patients in 231 centers with moderate and severe Traumatic Brain Injury (TBI) enrolled in eight Randomized Controlled Trials (RCTs) and three observational studies. We fitted logistic random effects regression models with the 5-point Glasgow Outcome Scale (GOS) as outcome, both dichotomized as well as ordinal, with center and/or trial as random effects, and as covariates age, motor score, pupil reactivity or trial. We then compared the implementations of frequentist and Bayesian methods to estimate the fixed and random effects. Frequentist approaches included R (lme4), Stata (GLLAMM), SAS (GLIMMIX and NLMIXED), MLwiN (p[R]IGLS) and MIXOR, Bayesian approaches included WinBUGS, MLwiN (MCMC), R package MCMCglmm and SAS experimental procedure MCMC. Three data sets (the full data set and two sub-datasets) were analysed using basically two logistic random effects models with either one random effect for the center or two random effects for center and trial. For the ordinal outcome in the full data set also a proportional odds model with a random center effect was fitted.

Results: The packages gave similar parameter estimates for both the fixed and random effects and for the binary (and ordinal) models for the main study and when based on a relatively large number of level-1 (patient level) data compared to the number of level-2 (hospital level) data. However, when based on relatively sparse data set, i.e. when the numbers of level-1 and level-2 data units were about the same, the frequentist and Bayesian approaches showed somewhat different results. The software implementations differ considerably in flexibility, computation time, and usability. There are also differences in the availability of additional tools for model evaluation, such as diagnostic plots. The experimental SAS (version 9.2) procedure MCMC appeared to be inefficient.

Conclusions: On relatively large data sets, the different software implementations of logistic random effects regression models produced similar results. Thus, for a large data set there seems to be no explicit preference (of course if there is no preference from a philosophical point of view) for either a frequentist or Bayesian approach (if based on vague priors). The choice for a particular implementation may largely depend on the desired flexibility, and the usability of the package. For small data sets the random effects variances are difficult to estimate. In the frequentist approaches the MLE of this variance was often estimated zero with a standard error that is either zero or could not be determined, while for Bayesian methods the estimates could depend on the chosen "noninformative" prior of the variance parameter. The starting value for the variance parameter may be also critical for the convergence of the Markov chain. 

meta-analysis of phynotypic effects - a case (with MCMCglmm used)

http://www.sciencedirect.com/science/article/pii/S0003347211001229
Dominance and plumage traits: meta-analysis and metaregression analysis 

Procedures of Meta-analysis and Metaregression

All meta-analyses were conducted in the statistical software S-Plus (TIBCO; http://www.tibco.com/) and R (version 2.11.1; R Development Core Team 2010), using LMMs to perform a random-effect meta-analysis (Nakagawa et al. 2007). In all analyses, we accounted for the hierarchical structure in the data (e.g. with multiple effect sizes from the same population) by including study population and species as nested random effects (Nakagawa et al. 2007). This statistical procedure allowed us to use multiple effect sizes from a study or population in the same analysis without violating the assumption of independence (Nakagawa & Hauber 2011). Our meta-analytical LMM was calculated as an intercept-only model with the restricted maximum likelihood (REML) method (nlme package; Pinheiro & Bates 2000). Reported P values are for the main effects (intercepts) only.
To determine whether a set of effect sizes was homogeneous, we calculated the residual heterogeneity QREML, as random-effects models were used (Nakagawa et al. 2007). When residual heterogeneity was significant, the variance among effect sizes was greater than expected from sampling error, suggesting the existence of important moderator variables. It is worth emphasizing that even though our metaregressions accounted for some moderator variables (Table 1), it is still possible that other (unaccounted) moderators could have introduced heterogeneity in the data. Furthermore, we decided not to include interaction terms among predictor variables in our metaregressions, as the models would be overparameterized (i.e. models would have too many parameters and not enough data points in each category to be robust; Ginzburg & Jensen 2004). We conducted contrast analyses (LMMs) for all metaregression models to check for the effect of different levels of an explanatory variable on the relationship between dominance and plumage. We show the results of contrast analyses only if the difference between levels of a variable (contrasts) was statistically significant (all other results are in Table A2 in Appendix 2).
We also conducted a randomization test to evaluate the importance of the variance components of our random factors (i.e. study and species). If a variance component was significantly different from zero, it indicated the existence of either study or species effects, the latter of which may, but does not necessarily, imply phylogenetic signal in the data (cf. Hadfield & Nakagawa 2010). We tested the null hypothesis that the variance component = 0, against the alternative hypothesis that the variance component > 0 (Nakagawa & Schielzeth 2010). We randomized the original effect size vector 100 000 times, each followed by fitting the meta-analytical LMM to estimate randomized random factor variance components. The P value was determined as the proportion of randomizations that yielded a variance component larger than or equal to the variance component of the original data. Furthermore, we conducted a phylogenetic meta-analysis described in Hadfield & Nakagawa (2010) to account for the lack of independence across species caused by their evolutionary relationships, using the MCMCglmm package in R (Hadfield 2010). As our conclusions did not change with the use of the phylogenetic meta-analysis, we only present results based on the general meta-analysis (the results of the phylogenetic meta-analysis can be found in Appendix 3).



2011年8月18日星期四

set operation - an R tip

mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f"))

be careful of "drv %in% c("4", "f")".

2011年8月16日星期二

global weather and climate data at your fingertips - RNCEP package

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00138.x/full#f1

RNCEP, a package of functions in the open-source R language, to access, organise and visualise freely available atmospheric data from two long-term high-quality data sets with global coverage.

These functions retrieve data, via the Internet, for either a desired spatiotemporal extent or interpolated to a point in space and time. The package also contains functions to temporally aggregate data, producing user-defined variables, and to visualise these data on a map.
By making access to atmospheric data and integration with biological data easier and more flexible, we hope to facilitate and encourage the exploration of relationships between biological systems and atmospheric conditions.

See also:
https://sites.google.com/site/michaelukemp/rncep


2011年8月15日星期一

updating vcftools

(1)
--/vcftools_0.1.5/vcftools_0.1.5$ svn checkout https://vcftools.svn.sourceforge.net/svnroot/vcftools vcftools
(2)
--/vcftools_0.1.5/vcftools_0.1.5$ svn update

(3)
--/vcftools_0.1.5/vcftools_0.1.5$ make