How to Add an Extra Vertical Axis to R Plots


n <- 100="100" br="br">a1 <- br="br" n="n" rnorm="rnorm">a2 <- br="br" n="n" rnorm="rnorm">
x <- br="br" null="null">y <- code="code" null="null">
x[1] <- a1="a1" br="br">y[1] <- a2="a2" code="code">
for (t in 2:n) {
x[t] <- -.2="-.2" .5=".5" a1="a1" br="br" t-1="t-1" t="t" x="x" y="y">y[t] <- a2="a2" br="br" t-1="t-1" t="t" y="y">}
par(mar = c(5,5,2,5))
plot(ts(x), ylab = \"x\")
par(new = T)
plot(ts(y), col = \"red\", axes = F, xlab = NA, ylab = NA)
axis(side = 4)
mtext(side = 4, line = 3, \"y\")


update Zotero in ubuntu

This led to this forum post - http://forums.zotero.org/discussion/21759/update-failure-on-ubuntu-1110/ - and the idea that the problem was due to not having write access caused the problem. However, while granting write permissions to /opt and /opt/zotero was enough to download the update, it was not enough to install the update, and I had the error "The update could not be installed. Please make sure there are no other copies of Zotero running on your computer, and then restart Zotero to try again." My install is in /opt, simply using

# chmod -R o+w /opt/zotero /opt

finally solved the problem. After installing, I undid the procedure to return to the previous state with

# chmod -R o-w /opt/zotero /opt

vcflib - a set of tools for working with VCF



A model-based approach for analysis of spatial structure in genetic data

1. http://www.nature.com/ng/journal/v44/n6/abs/ng.2285.html

Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. We propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two- or three-dimensional space. In spatial ancestry analysis (SPA), we explicitly model the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. We show that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone. We apply our SPA method to a European and a worldwide population genetic variation data set and identify SNPs showing large gradients in allele frequency, and we suggest these as candidate regions under selection. These regions include SNPs in the well-characterized LCT region, as well as at loci including FOXP2OCA2 and LRP1B.


Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data

Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data

Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40–270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17–43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3–26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).


Interplay of recombination and selection in the genomes of Chlamydia trachomatis

Interplay of recombination and selection in the genomes of Chlamydia trachomatis


Chlamydia trachomatis is an obligate intracellular bacterial parasite, which causes several severe and debilitating diseases in humans. This study uses comparative genomic analyses of 12 complete published C. trachomatis genomes to assess the contribution of recombination and selection in this pathogen and to understand the major evolutionary forces acting on the genome of this bacterium.


The conserved core genes of C. trachomatis are a large proportion of the pan-genome: we identified 836 core genes in C. trachomatis out of a range of 874-927 total genes in each genome. The ratio of recombination events compared to mutation (ρ/θ) was 0.07 based on ancestral reconstructions using the ClonalFrame tool, but recombination had a significant effect on genetic diversification (r/m = 0.71). The distance-dependent decay of linkage disequilibrium also indicated that C. trachomatis populations behaved intermediately between sexual and clonal extremes. Fifty-five genes were identified as having a history of recombination and 92 were under positive selection based on statistical tests. Twenty-three genes showed evidence of being under both positive selection and recombination, which included genes with a known role in virulence and pathogencity (e.g., ompA, pmps, tarp). Analysis of inter-clade recombination flux indicated non-uniform currents of recombination between clades, which suggests the possibility of spatial population structure in C. trachomatis infections.


C. trachomatis is the archetype of a bacterial species where recombination is relatively frequent yet gene gains by horizontal gene transfer (HGT) and losses (by deletion) are rare. Gene conversion occurs at sites across the whole C. trachomatis genome but may be more often fixed in genes that are under diversifying selection. Furthermore, genome sequencing will reveal patterns of serotype specific gene exchange and selection that will generate important research questions for understanding C. trachomatis pathogenesis.

Evolutionary and population genomics of the cavity causing bacteria Streptococcus mutans

Evolutionary and population genomics of the cavity causing bacteria Streptococcus mutans

Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 Smutans isolates, as well as representative strains of the most closely related species to Smutans (Streptococcus rattiStreptococcus macaccaeStreptococcus criceti), in order to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5% to 15%. The core genome consisted of 1490 genes and the pan-genome approximately 3296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that theSmutans population started expanding exponentially around 10,000 years ago (95% CI: 3,268 — 14,344 ya), approximately coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of Smutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of Smutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.


Inferring the History of Population Size Change from Genome-Wide SNP Data

Dense, genome-wide single-nucleotide polymorphism (SNP) data can be used to reconstruct the demographic history of human populations. However, demographic inferences from such data are complicated by recombination and ascertainment bias. We introduce two new statistics, allele frequency-identity by descent (AF-IBD) and allele frequency-identity by state (AF-IBS), that make use of linkage disequilibrium information and show defined relationships to the time of coalescence. These statistics, when conditioned on the derived allele frequency, are able to infer complex population size changes. Moreover, the AF-IBS statistic, which is based on genome-wide SNP data, is robust to varying ascertainment conditions. We constructed an efficient approximate Bayesian computation (ABC) pipeline based on AF-IBD and AF-IBS that can accurately estimate demographic parameters, even for fairly complex models. Finally, we applied this ABC approach to genome-wide SNP data and inferred the demographic histories of two human populations, Yoruba and French. Our results suggest a rather stable ancestral population size with a mild recent expansion for Yoruba, whereas the French seemingly experienced a long-lasting severe bottleneck followed by a drastic population growth. This approach should prove useful for new insights into populations, especially those with complex demographic histories.



population genomic studies of bacteria - two cases


After the bottleneck: Genome-wide diversification of the Mycobacterium tuberculosis complex by mutation, recombination, and natural selection


Population Genomics of Chlamydia trachomatis: Insights on Drift, Selection, Recombination, and Population Structure

detect recombination from multiple sequence alignment

1. http://www.genetics.org/content/175/3/1251.full

Inference of Bacterial Microevolution Using Multilocus Sequence Data

We describe a model-based method for using multilocus sequence data to infer the clonal relationships of bacteria and the chromosomal position of homologous recombination events that disrupt a clonal pattern of inheritance. The key assumption of our model is that recombination events introduce a constant rate of substitutions to a contiguous region of sequence. The method is applicable both to multilocus sequence typing (MLST) data from a few loci and to alignments of multiple bacterial genomes. It can be used to decide whether a subset of isolates share common ancestry, to estimate the age of the common ancestor, and hence to address a variety of epidemiological and ecological questions that hinge on the pattern of bacterial spread. It should also be useful in associating particular genetic events with the changes in phenotype that they cause. We show that the model outperforms existing methods of subdividing recombinogenic bacteria using MLST data and provide examples from Salmonella and Bacillus. The software used in this article, ClonalFrame, is available from http://bacteria.stats.ox.ac.uk/.

2. http://www.genetics.org/content/186/4/1435.full

Inference of Homologous Recombination in Bacteria Using Whole-Genome Sequences

Bacteria and archaea reproduce clonally, but sporadically import DNA into their chromosomes from other organisms. In many of these events, the imported DNA replaces an homologous segment in the recipient genome. Here we present a new method to reconstruct the history of recombination events that affected a given sample of bacterial genomes. We introduce a mathematical model that represents both the donor and the recipient of each DNA import as an ancestor of the genomes in the sample. The model represents a simplification of the previously described coalescent with gene conversion. We implement a Monte Carlo Markov chain algorithm to perform inference under this model from sequence data alignments and show that inference is feasible for whole-genome alignments through parallelization. Using simulated data, we demonstrate accurate and reliable identification of individual recombination events and global recombination rate parameters. We applied our approach to an alignment of 13 whole genomes from the Bacillus cereus group. We find, as expected from laboratory experiments, that the recombination rate is higher between closely related organisms and also that the genome contains several broad regions of elevated levels of recombination. Application of the method to the genomic data sets that are becoming available should reveal the evolutionary history and private lives of populations of bacteria and archaea. The methods described in this article have been implemented in a computer software package, ClonalOrigin, which is freely available fromhttp://code.google.com/p/clonalorigin/.


Extract private SNPs from multi-sample VCF file


I would use awk to find one sample having a genotype!="0/0"
 gunzip -c my.vcf.gz |
awk -F ' ' '/^#/{print; next;} {n=0; for(i=10;i<=NF;++i) { if($i!="." &&  index($i,"0/0")==0) n++;} if(n==1) print; } '
change the test according to your needs.

The Best Scientific Figures of 2012


Humanity's Recent Evolution

It's easy to think that modern life has slowed human evolution, but the opposite is true. Most of humanity's genetic variation has accumulated in the last few thousand years during a period of extraordinary population growth. As a species, we're more evolvable than ever.
In the figure above, taken from an in-depth genetic analysis of 6,515 people, the amount of genetic variation collectively present at each location in the human genome is tabulated from before (left) and after (right) the population boom.
Citation: “Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.” By Wenqing Fu, Timothy D. O’Connor, Goo Jun, Hyun Min Kang, Goncalo Abecasis, Suzanne M. Leal, Stacey Gabriel, David Altshuler, Jay Shendure, Deborah A. Nickerson, Michael J. Bamshad, NHLBI Exome Sequencing Project & Joshua M. Akey. Vol. 491, No. 7426, 29 November 2012


NetView - Detect Fine-Scale Population Structures from Genome-Wide Patterns of Variation



readscan - separate host and pathogen read sequences

Readscan is a program to identify pathogen/contaminant sequences in whole genome shotgun sequencing datasets.

scp multiple files - a bash script

filelist=$(ssh 'echo *' scp_user@scp_host:remotedir/) 
for File in "$filelist" 
scp scp_user@scp_host:remotedir/"$FILE" /home/ugen/ 

TBA - work with multiple alignment MAF format

The resulting TBA multiple alignment is now ready for analysis. Several tools (in addition
to the above mentioned maf project) are available to aid you in this process and are
described in the README. Several are worth noting here:

maf order orders the species in each alignment block to some speci ed order. It will also
remove unwanted species from the multiple alignment (rather than rerunning tba
with a subset of species).
mafFind extracts a sub-region of the multiple alignment.
maf2fasta takes a projected MAF le and re-formats it to either MultiPipMaker format or
multi-FASTA format, representing the entire reference species' sequence, but only
5the alignable portions of all other species' sequences. This program is particularly
useful since the MAF speci cation is relatively new
(see http://genome.ucsc.edu/goldenPath/help/maf.html)
and is not yet as universal as the FASTA format.




consist of 由…构成
consist in 在于;存在于
consist with 符合;与…一致

The data consists in 370 founder individuals from the HapMap Phase III datasets, of the CEU, TSI, CHB and JPT populations. 


for two reasons - 两方面的原因

Proper recombination between homologs is critical for two reasons: first, the physical link between homologs helps establish their alignment on the meiotic spindle and correct segregation at the first meiotic division; and second, the exchange of DNA provides a nearly limitless source of genetic diversity


viralzone - get knowledge of virus


extract subset of a fasta file

#fasta file: pa101.fasta

#script: sequence_extractor.sh

# The 1 based sequence extractor - sequence_extractor.sh
# No guarantees offered.

# usage: 
# 1) download the script or copy the contents
# of the script and save it as sequence_extractor.sh
# 2) make it executable: chmod 755 sequence_extractor.sh
# reads from standard input or command line 
# 3) run the script: ./sequence_extractor.sh ps101.fasta 4 6

# create a backup copy of the input fasta file
# and delete the header 
sed -i.tmp -e '1d' $1 || exit $?

# merge the lines 
temp_var1=`awk '{printf $0;}' $1` || exit $?

# select the region
temp_var2=$(((($3-1)-($2-1))+1)) || exit $?

# display the extracted sequence
echo ${temp_var1:$(($2-1)):$temp_var2} && mv $1.tmp $1 || exit $?


the climatic record from Greenland - last glacial maximum


If we accept an earlier colonization into the Americas, the story is not so neat, because there were substantial time periods between 60 and 30 thousand when colonization was possible environmentally—though these were relatively brief. But, if we look at the climatic record from Greenland rather than Antarctica (Fig. 6b), which should be more appropriate for northern latitudes, it would seem that between 55 and 25 thousand years ago, the warmer episodes were short lived extremes in a very rapidly fluctuating climate (Bender et al., 1994)—and perhaps it was the unpredictability of climate that made it difficult to work out how to adapt to the north. By the time the first Americans crossed Beringia, they seem to have learned to deal with such unpredictability because they survived the Younger Dryas fluctuations (Haynes, 2008).

last glacial maximum

Unix and Perl Primer for Biologists - v3.1.1

Unix and Perl Primer for Biologists  

last updated: October 2012
If you download the entire course and uncompress the resulting zip file, then this should create a directory called 'Unix_and_Perl_course'. Inside this directory will be a 'Documentation' folder which has all three versions of the documentation (text, HTML, and PDF). The documentation is mostly aimed to be read from start to finish, though if you are comfortable with Unix you can jump to the sections on Perl.

IDEA - calculate dN dS ratio for multiple sequence, in paralle


The availability of complete genomic sequences for hundreds of organisms promises to make obtaining genome-wide estimates of substitution rates, selective constraints and other molecular evolution variables of interest an increasingly important approach to addressing broad evolutionary questions. Two of the programs most widely used for this purpose are codeml and baseml, parts of the PAML (Phylogenetic Analysis by Maximum Likelihood) suite. A significant drawback of these programs is their lack of a graphical user interface, which can limit their user base and considerably reduce their efficiency.


We have developed IDEA (Interactive Display for Evolutionary Analyses), an intuitive graphical input and output interface which interacts with PHYLIP for phylogeny reconstruction and with codeml and baseml for molecular evolution analyses. IDEA's graphical input and visualization interfaces eliminate the need to edit and parse text input and output files, reducing the likelihood of errors and improving processing time. Further, its interactive output display gives the user immediate access to results. Finally, IDEA can process data in parallel on a local machine or computing grid, allowing genome-wide analyses to be completed quickly.


IDEA provides a graphical user interface that allows the user to follow a codeml or baseml analysis from parameter input through to the exploration of results. Novel options streamline the analysis process, and post-analysis visualization of phylogenies, evolutionary rates and selective constraint along protein sequences simplifies the interpretation of results. The integration of these functions into a single tool eliminates the need for lengthy data handling and parsing, significantly expediting access to global patterns in the data.


Genome sequences reveal divergence times of malaria parasite lineages


The evolutionary history of human malaria parasites (genus Plasmodium) has long been a subject of speculation and controversy. The complete genome sequences of the two most widespread human malaria parasites, P. falciparum and P. vivax, and of the monkey parasite P. knowlesi are now available, together with the draft genomes of the chimpanzee parasite P. reichenowi, three rodent parasites, P. yoelii yoelli, P. berghei and P. chabaudi chabaudi, and one avian parasite, P. gallinaceum.


We present here an analysis of 45 orthologous gene sequences across the eight species that resolves the relationships of major Plasmodium lineages, and provides the first comprehensive dating of the age of those groups.


Our analyses support the hypothesis that the last common ancestor of P. falciparum and the chimpanzee parasite P. reichenowi occurred around the time of the human-chimpanzee divergence. P. falciparum infections of African apes are most likely derived from humans and not the other way around. On the other hand, P. vivax, split from the monkey parasite P. knowlesi in the much more distant past, during the time that encompasses the separation of the Great Apes and Old World Monkeys.


The results support an ancient association between malaria parasites and their primate hosts, including humans.

Anisimova, M - on molecular evolution, and genomics


  • Anisimova, M. 2012. Parametric models of codon evolution in Codon Evolution: mechanisms and models, eds. Cannarozzi G, Schneider A., Oxford University Press link
  • Anisimova, M. and D. Liberles 2012. Detecting and understanding natural selection, in Codon Evolution: mechanisms and models, eds. Cannarozzi G, Schneider A., Oxford University Press link
  • Roth, A., M. Anisimova, and G.M. Cannarozzi 2012. Measuring codon usage bias, in Codon Evolution: mechanisms and models, eds. Cannarozzi G, Schneider A., Oxford University Press link
  • Schirrmeister, B.E., D.A. Dalquen, M. Anisimova and H.C. Bagheri 2012. Gene copy number variation and its significance in cyanobacterial phylogeny. BMC Microbiology 2:177, doi:10.1186/1471-2180-12-177 link
  • Schaper, E., A.V. Kajava, A. Hauser, and M. Anisimova 2012. Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences Nucl. Acids Res. doi: 10.1093/nar/gks726 link

Evolutionary Genomics: statistical and computational methods

Anisimova, M. (Ed.) 2012. Evolutionary Genomics: statistical and computational methods Springer (Humana Press):

The genetic architecture of adaptations to high altitude in Ethiopia

Although hypoxia is a major stress on physiological processes, several human populations have survived for millennia at high altitudes, suggesting that they have adapted to hypoxic conditions. This hypothesis was recently corroborated by studies of Tibetan highlanders, which showed that polymorphisms in candidate genes show signatures of natural selection as well as well-replicated association signals for variation in hemoglobin levels. We extended genomic analysis to two Ethiopian ethnic groups: Amhara and Oromo. For each ethnic group, we sampled low and high altitude residents, thus allowing genetic and phenotypic comparisons across altitudes and across ethnic groups. Genome-wide SNP genotype data were collected in these samples by using Illumina arrays. We find that variants associated with hemoglobin variation among Tibetans or other variants at the same loci do not influence the trait in Ethiopians. However, in the Amhara, SNP rs10803083 is associated with hemoglobin levels at genome-wide levels of significance. No significant genotype association was observed for oxygen saturation levels in either ethnic group. Approaches based on allele frequency divergence did not detect outliers in candidate hypoxia genes, but the most differentiated variants between high- and lowlanders have a clear role in pathogen defense. Interestingly, a significant excess of allele frequency divergence was consistently detected for genes involved in cell cycle control, DNA damage and repair, thus pointing to new pathways for high altitude adaptations. Finally, a comparison of CpG methylation levels between high- and lowlanders found several significant signals at individual genes in the Oromo.



Demographic processes shaping genetic variation

Demographic processes modulate genome-wide levels and patterns of genetic variation via impacting effective population size independently of natural selection. Such processes include the perturbation of population distributions from external events shaping habitat landscape and internal factors shaping the probability of contemporaneous alleles in a population (coalescence). Several patterns have recently emerged: spatial and temporal heterogeneity in population structure have different influences on the persistence of new mutations and genetic variation, multi-locus analyses indicate that gene flow continues to occur during speciation and the incorporation of demographic processes into models of molecular evolution and association genetics approaches has improved statistical power to detect deviations from neutral-equilibrium expectations and decreased false positive rates.


Quantitative visualization of biological data in Google Earth using R2G2, an R CRAN package


We briefly introduce R2G2, an R CRAN package to visualize spatially explicit biological data within the Google Earth interface. Our package combines a collection of basic graph-editing features, including automated placement of dots, segments, polygons, images (including graphs produced with R), along with several complex three-dimensional (3D) representations such as phylogenies, histograms and pie charts. We briefly present some example data sets and show the immediate benefits in communication gained from using the Google Earth interface to visually explore biological results. The package is distributed with detailed help pages providing examples and annotated source scripts with the hope that users will have an easy time using and further developing this package. R2G2 is distributed onhttp://cran.r-project.org/web/packages.


In UNIX grep a phrase and adjacent lines


grep -A2 SELECT

That will return the line matching and the next two lines, found the answer here - 


You can also do

grep -A2 -i select

so it matches upper or lower case (-i is ignore case)

split file into files by pattern


Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc

I like to split the above file into 3 files like below,


Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee


Bxxxx jjjj dddd
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss


Bffff mmmm iiiii
Ktttt eeeeeee
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc

Basically the file need to be start with "B" record and start a new file when it come across another "B" record.

awk '/^B/{close("file"f);f++}{print $0 > "file"f}' input.txt

perl -n -e '/^B/ and open FH, ">output_".$n++; print FH;' input.txt

csplit -k input.txt '/^B/' '{99}'


attribute, ascribe - 归因于

1. population history could be attributed to the differentiation among populations.
2. It just fell within the range of the last glacial maximum (LGM), thus supporting that isolation of populations was ascribed to global climate change in Pleistocene.