advantages of Mixed model/Bayesian/MCMCglmm - gathered from relevant references

Phylogenetic mixed models have mainly been applied to traits which are assumed to be normally distributed (for exceptions, see Felsenstein, 2005; Naya et al., 2006). Generalized linear mixed models extend the linear mixed model to non-Gaussian responses, although model fitting has proved more difficult because the likelihood cannot be obtained in closed form. MCMC techniques solve this
problem by breaking the high-dimensional joint distribution into a series of lower dimensional conditional distributions which are easier to sample from. By repeatedly sampling from these conditional distributions it is possible to very accurately approximate the complete joint distribution, and thereby extract things of interest (often marginal distributions).

Hadfield, J. D., & Nakagawa, S. (2010). General quantitative genetic methods for comparative biology: Phylogenies, taxonomies and multi-trait models for continuous and categorical characters. Journal of Evolutionary Biology, 23(3), 494-508.

The major reason for the popularity of mixed modelling is probably its ability to account for statistical non-independence of data by having random effects as well as fixed effects (the name ‘mixed-effects’ originated from combining these two types of effects) (McCulloch and Searle, 2002). It is difficult to think of an example of any dataset where the data points would be truly independent from one another.


An important advantage of the mixed model approach relates to the statistical inferences drawn from non-normal data. To date, in neurosciences, non-parametric (NP) tests such as Mann-Whitney and Kruskal–Wallis tests have been often used to deal with small samples sizes, where normality cannot be tested, or with truly non-normally distributed data (Janusonis, 2009)

To test for a genetic change in the population, we used Bayesian methods. Use of Bayesian methods allows the time trends to be estimated directly instead of requiring statistics to be based on the best linear unbiased predictions (BLUPs) from a linear model, which incurs problems in terms of error propagation (Ovaskainen et al., 2008; Hadfield, 2010).

J . EVOL. BI O L . 23 ( 2 0 1 0 ) 935–944

Our dataset allowed us to estimate the variance in offspring sex ratio, both at ca 6 days post-hatching and at independence from parental care, and to test how much of the observed variation in sex ratio was accounted for by additive genetic variance as opposed to environmental/non-additive genetic and sampling variance. 

Variance components were estimated by fitting a generalized animal model. An animal model is a specific type of mixed model that explicitly takes into account the resemblance among all relatives. It models an individual's phenotype as a function of a number of fixed and random effects, including a random additive genetic ‘animal’ effect. The variance–covariance structure of the latter is proportional to the pairwise coefficients of relatedness among all individuals in the pedigree. Thereby an animal model allows us to fully exploit all pedigree data, and to simultaneously account for a number of potentially confounding environmental effects [3234]. Fitting an animal model, or any mixed model for that matter, with non-Gaussian traits using (restricted) maximum-likelihood techniques is challenging. Hence, we used Bayesian Markov chain Monte Carlo (MCMC) techniques implemented in the R package MCMCglmm [30,35]

Disentangling the effect of genes, the environment and chance on sex ratio variation in a wild bird population



R - loop over files, and assign appropriate names to objects created in the loop

An good example on: 

(1) loop over files, 
(2) assign appropriate names to objects created in the loop


which(), any(), apply() - three R functions

The following codes just show you the R functions: which(), apply(), and any().




Nped <- BTped[which(apply(BTped, 1, function(x) { any(x == "R187920" | x == "R187921")  })), ]


get genetic location for gene names - using biomaRt library

From the biomaRt vignette:


ensembl=useMart("ensembl", dataset="hsapiens_gene_ensembl")

            filters = c("chromosome_name", "start","end"), 
            values = list(16, 1100000, 1250000), 
            mart = ensembl)

R code - manipulating biplot


## Indicators of Management type
inds<- with(dune.env, indval(dune, Management))

## ordinate
dune.pca<- rda(dune, scale = TRUE)

## exract scores you want to plot presume species and sites
dune.site<- scores(dune.pca, display = "sites", scaling = 3)
dune.spp<- scores(dune.pca, display = "species",
                     scaling = 3)[inds$pval<= 0.05, ]

## plot
plot(dune.pca, display = c("sites","species"), type = "n",
      scaling = 3)
arrows(0, 0, dune.spp[,1], dune.spp[,2], col = "red", length = 0.05)
lab.xloc<- dune.spp[,1] * 1.2
lab.yloc<- dune.spp[,2] * 1.2
text(lab.xloc, lab.yloc, rownames(dune.spp), col = "red", cex = 0.8)

Converting R contingency tables to data frames


A nice guide.

lyx - GUI for Tex/LaTex text editing

If you need math/math relevant stuffs/writing doc/thesis/books in your life. Lyx may be interesting for you. Under Lyx, you do not need be a Tex programmer, but you can benefit from Tex.

Give a try, now? It works on whatever Win, Mac, Linux.



Form row and column sums and means for numeric arrays

Form row and column sums and means for numeric arrays.


colSums (x, na.rm = FALSE, dims = 1)
rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)

model.matrix - design matrix for statistical model

model.matrix(object, data = environment(object),
             contrasts.arg = NULL, xlev = NULL, ...)


priors for a multi-response model in MCMCglmm


modeling nested predictor in MCMCglmm - a nice explanation

If every nest has a unique identifier then MCMCglmm should give the  
same answer (up to Monte Carlo error) for random ~ pairID + nest and  
random ~ pairID + pairID:nest,

zero inflation model for MCMCglmm - here a nice reference


Generate factors by specifying the pattern of their levels

Here, I would like to record the R function for generating factors by specifying the pattern of their levels: 
gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)
# Examples
## First control, then treatment:
gl(2, 8, labels = c("Control", "Treat"))
## 20 alternating 1s and 2s
gl(2, 1, 20)
## alternating pairs of 1s and 2s
gl(2, 2, 20)


Can contemporary and historic processes be separated in the analyses of DNA sequence data? - a discussion



Field guide to next-generation DNA sequencers


The diversity of available 2nd and 3rd generation DNA sequencing platforms is increasing rapidly. Costs for these systems range from <$100 000 to more than $1 000 000, with instrument run times ranging from minutes to weeks. Extensive trade-offs exist among these platforms. I summarize the major characteristics of each commercially available platform to enable direct comparisons. In terms of cost per megabase (Mb) of sequence, the Illumina and SOLiD platforms are clearly superior (≤$0.10/Mb vs. >$10/Mb for 454 and some Ion Torrent chips). In terms of cost per nonmultiplexed sample and instrument run time, the Pacific Biosciences and Ion Torrent platforms excel, with the 454 GS Junior and Illumina MiSeq also notable in this regard. All platforms allow multiplexing of samples, but details of library preparation, experimental design and data analysis can constrain the options. The wide range of characteristics among available platforms provides opportunities both to conduct groundbreaking studies and to waste money on scales that were previously infeasible. Thus, careful thought about the desired characteristics of these systems is warranted before purchasing or using any of them. Updated information from this guide will be maintained at: http://dna.uga.edu/ and http://tomato.biol.trinity.edu/blog/.

Analysing recombination in nucleotide sequences


Throughout the living world, genetic recombination and nucleotide substitution are the primary processes that create the genetic variation upon which natural selection acts. Just as analyses of substitution patterns can reveal a great deal about evolution, so too can analyses of recombination. Evidence of genetic recombination within the genomes of apparently asexual species can equate with evidence of cryptic sexuality. In sexually reproducing species, nonrandom patterns of sequence exchange can provide direct evidence of population subdivisions that prevent certain individuals from mating. Although an interesting topic in its own right, an important reason for analysing recombination is to account for its potentially disruptive influences on various phylogenetic-based molecular evolution analyses. Specifically, the evolutionary histories of recombinant sequences cannot be accurately described by standard bifurcating phylogenetic trees. Taking recombination into account can therefore be pivotal to the success of selection, molecular clock and various other analyses that require adequate modelling of shared ancestry and draw increased power from accurately inferred phylogenetic trees. Here, we review various computational approaches to studying recombination and provide guidelines both on how to gain insights into this important evolutionary process and on how it can be properly accounted for during molecular evolution studies.


really nice plots with the right color combinations

the plots from http://en.wikipedia.org/wiki/Standard_deviation


Data science toolkit - address, coordinates, text, file, IP address, people names

Welcome to the Data Science Toolkit

Steal this server!

Grab this entire site as a free, self-contained, ready-to-run VM

Independence - Never worry about the provider going offline, or charging once you're hooked.

Security - Run on your intranet, so customer information stays within the firewall.

Scalability - No API limits. Run a cluster of as many instances as you need.

Street Address to Coordinates

API: /street2coordinates
Street Address to Location calculates the latitude/longitude coordinates for a postal address.
Currently restricted to the US and UK.
Try it for yourself. Copy and paste some addresses into the box below to see what locations it finds.

File to Text

API: /file2text
Converts PDFs, Word Documents, Excel Spreadsheets to text.
Recovers text from JPEG, PNG or TIFF images of scanned documents.
Try it for yourself. Upload a file to see what text it finds.

Coordinates to Political Areas

API: /coordinates2politics
Returns the country, region, state, county, constituencies and neighborhood a point is inside.
Try it for yourself. Copy and paste some coordinates into the box below to see what it finds.


API: /v1/document
Geodict pulls country, city and region names from unstructured English text, and returns their coordinates.
It emulates the interface to Yahoo's Placemaker , so switching should just mean changing 'http://wherein.yahooapis.com/' to 'http://www.datasciencetoolkit.org/' in your current code.
Try it for yourself. Copy and paste some text into the box below to see what locations it finds.
Extract Locations

IP Address to Coordinates

API: /ip2coordinates
IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.
Try it for yourself. Copy and paste some IP addresses into the box below to see what locations it finds.

Text to Sentences

API: /text2sentences
Removes the parts of the text that seem to be boilerplate, leaving the real sentences.
Try it for yourself. Copy and paste a large chunk of text into the box below to see what sentences it identifies.
Get Sentences

HTML to Text

API: /html2text
Returns the full text that would actually be displayed in the browser when an HTML document was rendered.
Try it for yourself. Copy and paste your HTML into the box below to convert it into plain text.
Get Text

HTML to Story

API: /html2story
Takes an HTML document representing a news article or similar page, and extracts just the story text.
Try it for yourself. Copy and paste your HTML into the box below to grab the story text.
Extract Story

Text to People

API: /text2people
Spots text fragments that look like people's names or titles, and guesses their gender where possible.
Try it for yourself. Copy and paste your text into the box below to extract people's names.
Find Names

Text to Times

API: /text2times
Spots text fragments that look like times or dates, and converts them into a standard form.
Try it for yourself. Copy and paste your text into the box below to extract times and dates.
Find Times


Evolutionary principles and their practical application



Evolutionary principles are now routinely incorporated into medicine and agriculture. Examples include the design of treatments that slow the evolution of resistance by weeds, pests, and pathogens, and the design of breeding programs that maximize crop yield or quality. Evolutionary principles are also increasingly incorporated into conservation biology, natural resource management, and environmental science. Examples include the protection of small and isolated populations from inbreeding depression, the identification of key traits involved in adaptation to climate change, the design of harvesting regimes that minimize unwanted life-history evolution, and the setting of conservation priorities based on populations, species, or communities that harbor the greatest evolutionary diversity and potential. The adoption of evolutionary principles has proceeded somewhat independently in these different fields, even though the underlying fundamental concepts are the same. We explore these fundamental concepts under four main themes: variation, selection, connectivity, and eco-evolutionary dynamics. Within each theme, we present several key evolutionary principles and illustrate their use in addressing applied problems. We hope that the resulting primer of evolutionary concepts and their practical utility helps to advance a unified multidisciplinary field of applied evolutionary biology.

Package bayesGen - Association analysis of genomic data using a Bayesian shared component model

We propose a shared-component model to tease out the genotype information that is common to cases and controls from the one that is specific to cases. This allows to detect the SNPs (bayesSNPassoc function) or CNVs (bayesCNVassoc function) that show the strongest association with the disease. The model can nevertheless be applied to more than one disease.


R based Computational tools for Bayesian analysis

The increasing number of R-oriented Bayesian computational tools such as MCMCpack, MCMCglmm, DPpackage, R-INLA, spBayes, have made BUGS less and less crucial for day to day Bayesian computation. Honestly, I cannot figure out a single analysis that BUGS can do but at least one of the above mentioned packages cannot.


From a quick look at the web site, R-INLA seems to be able to carry out bayesian analysis without the actual MCMC sampling. This makes it possible to quickly try various different model specifications without hours of waiting time. This package looks very interesting and definitely deserves more attention.


ggplot competition


this is the grand prize for 2010. David Kahle of Rice University, who integrated public crime statistics and Google Maps to create this "violent crime weather map" for the city of Houston, Texas.

Statistical models cheat sheet


statistical analysis decision tree - from google search


Q & A of statistics and R

You could get your statistical and R questions answered by them. Or, communicate/help others by them.

1. R mailing lists

2.stackoverflow - r


4. stackexchange

IPSUR: Introduction to Probability and Statistics Using R

This is a free book, with LaTex, LyX codes for generating the whole book.


hzAnalyzer: detection, quantification, and visualization of contiguous homozygosity in high-density genotyping datasets


基于 R 和VCF format

error bar and barchart with multiple overlaying errorbars

(1) #############################
  1. library(Hmisc)  
  3. set.seed(1)  
  4. x <- 1:10  
  5. y <- x + rnorm(10)  
  6. delta <- runif(10)  
  7. errbar( as.character(x), y, y + delta, y - delta )  


(2) #############################


tips of Math-Stat-Comp from twitter


CompSciFact icon @CompSciFact Computer science
AlgebraFact icon @AlgebraFact Algebra and number theory
RLangTip icon @RLangTip R language and related topics
ProbFact icon @ProbFact Probability
SciPyTip icon @SciPyTip SciPy (Scientific Python)
SansMouse icon @SansMouse Windows keyboard shortcuts
RegexTip icon @RegexTip Regular expression tips
TeXtip icon @TeXtip TeX and LaTeX tips
StatFact icon @StatFact Statistics
TopologyFact icon @TopologyFact Topology and geometry
AnalysisFact icon @AnalysisFact Real and complex analysis

dotchart with confidence intervals


trim VCF file by the times of currence of a SNP

cut -d '   ' -f 1-5 *.vcf |\ #chrom,pos,id,ref,alt
egrep -v "^#" |\ #remove the header
sort |\
uniq -c |\ # -c is for 'count'
egrep "^      5 " |\ #keep the mutations present 5 times
cut -c 9- > snp.txt #remove the count
you can the use this snp.txt to filter out your VCF with
grep -f snp.txt -v sample1.vcf > sample1.filtered.vcf
(it could be slow for a large number of snps) or by using unix 'join -v' (faster , but you'll need to create a extra column in your VCFs to create a uniq key(chrom/position/ref/alt)

You can use vcftools to achieve this:
  1. Start by finding those positions that occur in all the files (replace '3' by the actual number of files)
    vcf-isec -o -n =3 A.vcf.gz B.vcf.gz C.vcf.gz (...) | bgzip -c > intersect.vcf.gz
  2. Exclude from A positions which appear in the intersection
    vcf-isec -c A.vcf.gz intersect.vcf.gz > newA.vcf

convert VCF to BED format


sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'

mutually virtulizating Windows, Linux, Mac OS


创建windows,linux,mac 的虚拟机。


Workshops from UC riverside


SOAPdenovo assembly quality assessment


Is there any tool for drawing a gene/protein diagram


Resources for learning how to deal with next generation sequence



Quantitative genetic parameters for wild stream-living brown trout - using MCMCglmm

Quantitative genetic parameters for wild stream-living brown trout: heritability and parental effects



Adaptability depends on the presence of additive genetic variance for important traits. Yet few estimates of additive genetic variance and heritability are available for wild populations, particularly so for fishes. Here, we estimate heritability of length-at-age for wild-living brown trout (Salmo trutta), based on long-term mark-recapture data and pedigree reconstruction based on large-scale genotyping at 15 microsatellite loci. We also tested for the presence of maternal and paternal effects using a Bayesian version of the Animal model. Heritability varied between 0.16 and 0.31, with reasonable narrow confidence bands, and the total phenotypic variance increased with age. When introducing dam as an additional random effect (accounting for c. 7% of total phenotypic variance), the level of additive genetic variance and heritability decreased (0.12–0.21). Parental size (both for sires and for dams) positively influenced length-at-age for juvenile trout – either through direct parental effects or through genotype-environment correlations. Length-at-age is a complex trait reflecting the effects of a number of physiological, behavioural and ecological processes. Our data show that fitness-related traits such as length-at-age can retain high levels of additive genetic variance even when total phenotypic variance is high.

Bayesian Inference of Genetic Parameters Based on Conditional Decompositions of Multivariate Normal Distributions


It is widely recognized that the mixed linear model is an important tool for parameter estimation in the analysis of complex pedigrees, which includes both pedigree and genomic information, and where mutually dependent genetic factors are often assumed to follow multivariate normal distributions of high dimension. We have developed a Bayesian statistical method based on the decomposition of the multivariate normal prior distribution into products of conditional univariate distributions. This procedure permits computationally demanding genetic evaluations of complex pedigrees, within the user-friendly computer package WinBUGS. To demonstrate and evaluate the flexibility of the method, we analyzed two example pedigrees: a large noninbred pedigree of Scots pine (Pinus sylvestris L.) that includes additive and dominance polygenic relationships and a simulated pedigree where genomic relationships have been calculated on the basis of a dense marker map. The analysis showed that our method was fast and provided accurate estimates and that it should therefore be a helpful tool for estimating genetic parameters of complex pedigrees quickly and reliably. 

Codes for Multivariate Qst Fst analysis


summary statistics in R


Linkage Mapping and Comparative Genomics Using Next-Generation RAD Sequencing of a Non-Model Organism


Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags


Metabolic network drawing



Sequencing Technique Aids in Local De Novo Assembly; Yields Haplotype Information


vocabulary for R functions



bioinformatic blogs



accessing MySQL from R


Multiple Y-axis in an R plot



Best file system to share between Linux, Win & MacOS?

FAT32, should be the choice now. FAT32 (or even FAT16) is really fine unless you need single files >4GB or you need POSIX types of features.

NTFS has somewhat limited read/write support on OS X (using Fuse) and also on Linux, I think.

EXT2 has drivers for both Windows and OS X, though, so if you do need large files or modern features, ext2fs is doable, although you will need to install drivers on every non-Linux box you use with the card.

OSX can read/write HFS, HFS+, FAT16, FAT32, and read NTFS.

如果你想创建一个Linux, Windows, Mac OS 都能识别/读写的硬盘分区。现在看来,FAT32 是个不二选择。但这个分区类型不支持单个文件大于 4Gb 的情况。不过,其它分区类型都不可行,据说有些新的分区格式似乎正在开发。



     如果您的 U 盘、移动硬盘既要用于 PC 又要用于苹果电脑,Mac OS X 系统的 HFS+ 和 Windows 的 NTFS 格式显然都不行……HFS+ 在 Windows 下不识别,NTFS 格式的 U 盘、移动硬盘插在苹果电脑上只能读不能写。格式化成 FAT32 显然可以支持两个系统,但单一文件大于 4GB 就歇菜。
    但苹果电脑 Mac OS X 10.6.5 系统做出了一项意义重大的升级:支持 exFAT 磁盘格式,格式化成 exFAT 格式的 U 盘、移动硬盘在 Windows PC 上和苹果电脑 Mac OS X 系统下均能读写,而且支持超过单个体积超过 4 GB 的大文件。

创建exFAT格式分区的方法,(1) 点击苹果电脑屏幕右上角的放大镜按钮,Sportlight 搜索“磁盘工具”,在磁盘工具侧边栏选择要格式化的 U 盘或移动硬盘分区,右侧选择“抹掉”标签,在格式下拉菜单里选择 ExFAT 即可。

(2) Windows 7 和 Vista 默认就支持 exFAT 格式,如果是 XP 系统,要下载 KB955704 补丁:http://www.microsoft.com/downloads/details.aspx?displaylang=zh-cn&FamilyID=1cbe3906-ddd1-4ca2-b727-c2dff5e30f61



unix - alias 命令快捷

给 unix 命令创建快捷,可以省去很多可能很繁琐的操作。比如,进入你的工作目录,每天都要做这样的操作很多次。如果你的目录很深,那么肯定费力。你可以通过创建快捷简化这个工作。

alias maowork="cd /mao/work/folder/"

This means you just type maowork into the terminal, you will be directed to the specific folder you pointed out in the former line.

aliases are not saved permanently if you do not explicitly tell the system to do so. You need to put the alias command line (the former command line) into the ".bash_profile" in your home directory, and source that file.

Automatic SSH/SCP Login without Password

It should be useful. I have not checked it.

unix - processes 进程

Unix run processes in parallel is called multitasking.

# (1) check the running processes

# give you more
ps -T

# (2) understanding the ps output
PID - process ID
TTY - terminal name
STAT - (S) sleep
            (O) running
            (R) wait to run on the processor
            (I) just be started
            (T) paused
            (Z) that is ending
TIME - amount of computing time the process has used
CMD - command used by the process

# (3) killing process
you need to know the process ID of which you would like to kill

kill PID

unix - wildcard 通配符

wildcard 很有价值,它通常用在查询上。下面是几个最为常用的wildcard (通配符):

        与任何存在和不存在的字符或字符串相配,protein*, protein, proteins, protein-sequence
        与任何单一一个字符相配,seq?.txt, seq1.txt
[  ]      与包含在方括号中任意字符或字符界限相配,[A-Z]*,  anything with uppercase head
[!   ]    与上一条相反,[!a-zA-Z], anything but letters

unix one liner - delete a list of files

You may need to delete some files with a general pattern in their names. Here, the unix command:

find -name deletion_pattern.txt | xargs rm

you could substitute "deletion_pattern.txt" with a general pattern of your files needed to be delete.


PLINK/Seq - analysis of genomic variant data

A library for analysis of genomic variant data: