Proper recombination between homologs is critical for two reasons: first, the physical link between homologs helps establish their alignment on the meiotic spindle and correct segregation at the first meiotic division; and second, the exchange of DNA provides a nearly limitless source of genetic diversity


#fasta file: pa101.fasta

#script: sequence_extractor.sh

# The 1 based sequence extractor - sequence_extractor.sh
# No guarantees offered.

# usage: 
# 1) download the script or copy the contents
# of the script and save it as sequence_extractor.sh
# 2) make it executable: chmod 755 sequence_extractor.sh
# reads from standard input or command line 
# 3) run the script: ./sequence_extractor.sh ps101.fasta 4 6

# create a backup copy of the input fasta file
# and delete the header 
sed -i.tmp -e '1d' $1 || exit $?

# merge the lines 
temp_var1=`awk '{printf $0;}' $1` || exit $?

# select the region
temp_var2=$(((($3-1)-($2-1))+1)) || exit $?

# display the extracted sequence
echo ${temp_var1:$(($2-1)):$temp_var2} && mv $1.tmp $1 || exit $?


the climatic record from Greenland - last glacial maximum


If we accept an earlier colonization into the Americas, the story is not so neat, because there were substantial time periods between 60 and 30 thousand when colonization was possible environmentally—though these were relatively brief. But, if we look at the climatic record from Greenland rather than Antarctica (Fig. 6b), which should be more appropriate for northern latitudes, it would seem that between 55 and 25 thousand years ago, the warmer episodes were short lived extremes in a very rapidly fluctuating climate (Bender et al., 1994)—and perhaps it was the unpredictability of climate that made it difficult to work out how to adapt to the north. By the time the first Americans crossed Beringia, they seem to have learned to deal with such unpredictability because they survived the Younger Dryas fluctuations (Haynes, 2008).

Unix and Perl Primer for Biologists - v3.1.1

Unix and Perl Primer for Biologists  

last updated: October 2012
If you download the entire course and uncompress the resulting zip file, then this should create a directory called 'Unix_and_Perl_course'. Inside this directory will be a 'Documentation' folder which has all three versions of the documentation (text, HTML, and PDF). The documentation is mostly aimed to be read from start to finish, though if you are comfortable with Unix you can jump to the sections on Perl.

IDEA - calculate dN dS ratio for multiple sequence, in paralle


The availability of complete genomic sequences for hundreds of organisms promises to make obtaining genome-wide estimates of substitution rates, selective constraints and other molecular evolution variables of interest an increasingly important approach to addressing broad evolutionary questions. Two of the programs most widely used for this purpose are codeml and baseml, parts of the PAML (Phylogenetic Analysis by Maximum Likelihood) suite. A significant drawback of these programs is their lack of a graphical user interface, which can limit their user base and considerably reduce their efficiency.


We have developed IDEA (Interactive Display for Evolutionary Analyses), an intuitive graphical input and output interface which interacts with PHYLIP for phylogeny reconstruction and with codeml and baseml for molecular evolution analyses. IDEA's graphical input and visualization interfaces eliminate the need to edit and parse text input and output files, reducing the likelihood of errors and improving processing time. Further, its interactive output display gives the user immediate access to results. Finally, IDEA can process data in parallel on a local machine or computing grid, allowing genome-wide analyses to be completed quickly.


IDEA provides a graphical user interface that allows the user to follow a codeml or baseml analysis from parameter input through to the exploration of results. Novel options streamline the analysis process, and post-analysis visualization of phylogenies, evolutionary rates and selective constraint along protein sequences simplifies the interpretation of results. The integration of these functions into a single tool eliminates the need for lengthy data handling and parsing, significantly expediting access to global patterns in the data.


Genome sequences reveal divergence times of malaria parasite lineages


The evolutionary history of human malaria parasites (genus Plasmodium) has long been a subject of speculation and controversy. The complete genome sequences of the two most widespread human malaria parasites, P. falciparum and P. vivax, and of the monkey parasite P. knowlesi are now available, together with the draft genomes of the chimpanzee parasite P. reichenowi, three rodent parasites, P. yoelii yoelli, P. berghei and P. chabaudi chabaudi, and one avian parasite, P. gallinaceum.


We present here an analysis of 45 orthologous gene sequences across the eight species that resolves the relationships of major Plasmodium lineages, and provides the first comprehensive dating of the age of those groups.


Our analyses support the hypothesis that the last common ancestor of P. falciparum and the chimpanzee parasite P. reichenowi occurred around the time of the human-chimpanzee divergence. P. falciparum infections of African apes are most likely derived from humans and not the other way around. On the other hand, P. vivax, split from the monkey parasite P. knowlesi in the much more distant past, during the time that encompasses the separation of the Great Apes and Old World Monkeys.


The results support an ancient association between malaria parasites and their primate hosts, including humans.

Evolutionary Genomics: statistical and computational methods

Anisimova, M. (Ed.) 2012. Evolutionary Genomics: statistical and computational methods Springer (Humana Press):

The genetic architecture of adaptations to high altitude in Ethiopia

Although hypoxia is a major stress on physiological processes, several human populations have survived for millennia at high altitudes, suggesting that they have adapted to hypoxic conditions. This hypothesis was recently corroborated by studies of Tibetan highlanders, which showed that polymorphisms in candidate genes show signatures of natural selection as well as well-replicated association signals for variation in hemoglobin levels. We extended genomic analysis to two Ethiopian ethnic groups: Amhara and Oromo. For each ethnic group, we sampled low and high altitude residents, thus allowing genetic and phenotypic comparisons across altitudes and across ethnic groups. Genome-wide SNP genotype data were collected in these samples by using Illumina arrays. We find that variants associated with hemoglobin variation among Tibetans or other variants at the same loci do not influence the trait in Ethiopians. However, in the Amhara, SNP rs10803083 is associated with hemoglobin levels at genome-wide levels of significance. No significant genotype association was observed for oxygen saturation levels in either ethnic group. Approaches based on allele frequency divergence did not detect outliers in candidate hypoxia genes, but the most differentiated variants between high- and lowlanders have a clear role in pathogen defense. Interestingly, a significant excess of allele frequency divergence was consistently detected for genes involved in cell cycle control, DNA damage and repair, thus pointing to new pathways for high altitude adaptations. Finally, a comparison of CpG methylation levels between high- and lowlanders found several significant signals at individual genes in the Oromo.



Demographic processes shaping genetic variation

Demographic processes modulate genome-wide levels and patterns of genetic variation via impacting effective population size independently of natural selection. Such processes include the perturbation of population distributions from external events shaping habitat landscape and internal factors shaping the probability of contemporaneous alleles in a population (coalescence). Several patterns have recently emerged: spatial and temporal heterogeneity in population structure have different influences on the persistence of new mutations and genetic variation, multi-locus analyses indicate that gene flow continues to occur during speciation and the incorporation of demographic processes into models of molecular evolution and association genetics approaches has improved statistical power to detect deviations from neutral-equilibrium expectations and decreased false positive rates.


Quantitative visualization of biological data in Google Earth using R2G2, an R CRAN package


We briefly introduce R2G2, an R CRAN package to visualize spatially explicit biological data within the Google Earth interface. Our package combines a collection of basic graph-editing features, including automated placement of dots, segments, polygons, images (including graphs produced with R), along with several complex three-dimensional (3D) representations such as phylogenies, histograms and pie charts. We briefly present some example data sets and show the immediate benefits in communication gained from using the Google Earth interface to visually explore biological results. The package is distributed with detailed help pages providing examples and annotated source scripts with the hope that users will have an easy time using and further developing this package. R2G2 is distributed onhttp://cran.r-project.org/web/packages.


In UNIX grep a phrase and adjacent lines


grep -A2 SELECT

That will return the line matching and the next two lines, found the answer here - 


You can also do

grep -A2 -i select

so it matches upper or lower case (-i is ignore case)

split file into files by pattern


Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc

I like to split the above file into 3 files like below,


Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee


Bxxxx jjjj dddd
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss


Bffff mmmm iiiii
Ktttt eeeeeee
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc

Basically the file need to be start with "B" record and start a new file when it come across another "B" record.

awk '/^B/{close("file"f);f++}{print $0 > "file"f}' input.txt

perl -n -e '/^B/ and open FH, ">output_".$n++; print FH;' input.txt

csplit -k input.txt '/^B/' '{99}'


attribute, ascribe - 归因于

1. population history could be attributed to the differentiation among populations.
2. It just fell within the range of the last glacial maximum (LGM), thus supporting that isolation of populations was ascribed to global climate change in Pleistocene. 

counteract - 抵消

Restricted gene flow could not counteract the effect of genetic drift and resulted in differentiation among populations.

homoplasy - 异源相似性

However, closely related species delimitation based on morphologic analysis might be distorted by a high level of homoplasy (Nyffeler et al., 2005).



With increasing evidence implicating important biological roles of lincRNAs in animal cells (Barsotti and Prives, 2010; Qureshi et al., 2010), a comprehensive genome-wide analysis of plant lincRNA is warranted.

bona fide - 善意的;真实的;真诚的

Therefore, as a first step, we reanalyzed these ncRNAs in an attempt to identify bona fide lincRNAs.


update R in Ubuntu linux

Keeping R up to date on Ubuntu linux

R is included as part of the standard Ubuntu distribution, and can be installed with a command like
sudo apt-get install r-base
Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.
Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.
1sudo su
2echo "deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntuprecise/" >> /etc/apt/sources.list
3apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
4apt-get update
5apt-get upgrade
That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “www.stats.bris.ac.uk”). The second is the name of the Ubuntu distro you are running (here “precise”).

MULTIMIX - inferring the local ancestry of admixed individuals from dense genome-wide single nucleotide polymorphism data


We describe a novel method for inferring the local ancestry of admixed individuals from dense genome-wide single nucleotide polymorphism data. The method, called MULTIMIX, allows multiple source populations, models population linkage disequilibrium between markers and is applicable to datasets in which the sample and source populations are either phased or unphased. The model is based upon a hidden Markov model of switches in ancestry between consecutive windows of loci. We model the observed haplotypes within each window using a multivariate normal distribution with parameters estimated from the ancestral panels. We present three methods to fit the model—Markov chain Monte Carlo sampling, the Expectation Maximization algorithm, and a Classification Expectation Maximization algorithm. The performance of our method on individuals simulated to be admixed with European and West African ancestry shows it to be comparable to HAPMIX, the ancestry calls of the two methods agreeing at 99.26% of loci across the three parameter groups. In addition to it being faster than HAPMIX, it is also found to perform well over a range of extent of admixture in a simulation involving three ancestral populations. In an analysis of real data, we estimate the contribution of European, West African and Native American ancestry to each locus in the Mexican samples of HapMap, giving estimates of ancestral proportions that are consistent with those previously reported.


compromise, render

I am prepared to make some concession on minor detailsbut I cannot compromise on fundamentals


The loss of variation and the cost of domestication in genomes of crop species 
may compromise the level of natural defenses against pathogens 
and render them more susceptible than their wild relatives. 

contingent on

The accuracy of repeat genotypes is contingent on the proper mapping of reads to repeat loci.

promise benefit to

Further, analysing repeats in personal genomes promises benefit not just to medical genetics and the diagnosis of repeat-related disorders but also to forensics and genealogy, where shorter and more stable tandem repeats can serve as DNA fingerprints to uniquely identify individuals.


Linuxcast and Ecodecademy


LinuxCast:全方位的Linux学习与交流平台。一个提供免费的专业Linux视频、教学、问答及交流平台。LinuxCast以视频+在线问答的形式为您提供了一种全新的、简单的Linux学习方式,而内容却更加专业 Linux学习从此不再晦涩难懂。


Install R and Rstudio in Ubuntu

Install R in Ubuntu is extremely easy if you don’t meet any exception, but if you meet, then you’d better be a very advanced linux user :-)
Install R
Because the Ubuntu official source R version is usually half of years older than R-project official source, so it is recommanded to using r-project.org official source to install the latest R system.
vi /etc/apt/sources.list
# append below line to end of sources.list
# you can view mirror at http://cran.r-project.org/mirrors.html
deb http://ftp.ctex.org/mirrors/CRAN/bin/linux/ubuntu precise/
import the GPG key and install r-base
cd ~
gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
apt-get upgrade
apt-get install r-base
Install Oracle DB access package
You can found new version of ROracle or DBI package in CRAN, it is also required you properly install the Oracle Instant Client.
manual install the ROracle
wget http://cran.r-project.org/src/contrib/DBI_0.2-5.tar.gz
R CMD INSTALL DBI_0.2-5.tar.gz
wget http://cran.r-project.org/src/contrib/ROracle_1.1-5.tar.gz
R CMD INSTALL --configure-args='--with-oci-inc=/opt/oracle/instantclient_11_2/sdk/include' ROracle_1.1-5.tar.gz
Install RStudio Server
apt-get install libssl0.9.8 # must install even you have newer version
apt-get install libapparmor1 apparmor-utils
wget http://download2.rstudio.org/rstudio-server-0.96.331-i386.deb
dpkg -i rstudio-server-0.96.331-i386.deb
rstudio-server verify-installation
Do some RStudio Server setting
echo 'rsession-memory-limit-mb=1000' > /etc/rstudio/rserver.conf
echo 'rsession-stack-limit-mb=4' >> /etc/rstudio/rserver.conf
echo 'rsession-process-limit=20' >> /etc/rstudio/rserver.conf
# Only pass below if you will using proxy mode
echo 'www-address=' >> /etc/rstudio/rserver.conf
groupadd rstudio
Setting the proxy server for RStudio server
This section is optional, assured already install nginx in server.
do not forgot link to /opt/nginx/conf/vhosts
server {
  listen       80;
  server_name  cvprstudio;
  location / {
    proxy_pass http://localhost:8787;
    proxy_redirect http://localhost:8787/ $scheme://$host/;
Setting auto restart and PATH
ln -s /usr/lib/rstudio-server/extras/init.d/debian/rstudio-server /etc/init.d/rstudio-server
vi /etc/init.d/rstudio-server
append below line to /etc/init.d/rstudio-server SCRIPTNAME
Now you can restart/start via standard init.d service way
/etc/init.d/rstudio-server restart
Add a user in RStudio
adduser --ingroup rstudio cindy
passwd cindy # setting password
Update package
Usually it is more good to upgrade the r-base in system wide packages instead of per user