2011年3月31日星期四

Baysian method to find snps underlying adaptation

(1) Using Environmental Correlations to Identify Loci Underlying Local Adaptation
http://www.eve.ucdavis.edu/gmcoop/Software/Bayenv/Bayenv.html
http://www.genetics.org/cgi/content/short/185/4/1411

Loci involved in local adaptation can potentially be identified by an unusual correlation between allele frequencies and important ecological variables or by extreme allele frequency differences between geographic regions. However, such comparisons are complicated by differences in sample sizes and the neutral correlation of allele frequencies across populations due to shared history and gene flow. To overcome these difficulties, we have developed a Bayesian method that estimates the empirical pattern of covariance in allele frequencies between populations from a set of markers and then uses this as a null model for a test at individual SNPs. In our model the sample frequencies of an allele across populations are drawn from a set of underlying population frequencies; a transform of these population frequencies is assumed to follow a multivariate normal distribution. We first estimate the covariance matrix of this multivariate normal across loci using a Monte Carlo Markov chain. At each SNP, we then provide a measure of the support, a Bayes factor, for a model where an environmental variable has a linear effect on the transformed allele frequencies compared to a model given by the covariance matrix alone. This test is shown through power simulations to outperform existing correlation tests. We also demonstrate that our method can be used to identify SNPs with unusually large allele frequency differentiation and offers a powerful alternative to tests based on pairwise or global FST. Software is available at http://www.eve.ucdavis.edu/gmcoop/.

(2)
Adaptations to Climate in Candidate Genes for Common Metabolic Disorders. Hancock MA, Witonsky DB, Gordon AS, Eshel G, Pritchard JK, Coop G, Di Rienzo A PLoS Genetics 2008.

High-resolution mapping of crossovers - The Coop Lab

High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans. Coop G, Wen X, Ober C, Pritchard JK, Przeworski M. Science, 2008, pdf , Slides, Data

http://www.eve.ucdavis.edu/gmcoop/Publications.html

PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice . Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, Przeworski M, Coop G, de Massy B (2010) Science. See also the papers by Myers et al. and Parvanov et al. that appeared in the same issue.

Understanding the Evolution of Defense Metabolites in Arabidopsis thaliana Using Genome-wide Association Mapping

http://www.genetics.org/cgi/content/full/185/3/991

With the improvement and decline in cost of high-throughput genotyping and phenotyping technologies, genome-wide association (GWA) studies are fast becoming a preferred approach for dissecting complex quantitative traits. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of quantitative traits. GSLs are key defenses against insects in the wild and the relatively large number of cloned quantitative trait locus (QTL) controlling GSL traits allows comparison of GWA to previous QTL analyses. To better understand the specieswide genomic architecture controlling plant-insect interactions and the relative strengths of GWA and QTL studies, we conducted a GWA mapping study using 96 A. thaliana accessions, 43 GSL phenotypes, and ~230,000 SNPs. Our GWA analysis identified the two major polymorphic loci controlling GSL variation (AOP and MAM) in natural populations within large blocks of positive associations encompassing dozens of genes. These blocks of positive associations showed extended linkage disequilibrium (LD) that we hypothesize to have arisen from balancing or fluctuating selective sweeps at both the AOP and MAM loci. These potential sweep blocks are likely linked with the formation of new defensive chemistries that alter plant fitness in natural environments. Interestingly, this GWA analysis did not identify the majority of previously identified QTL even though these polymorphisms were present in the GWA population. This may be partly explained by a nonrandom distribution of phenotypic variation across population subgroups that links population structure and GSL variation, suggesting that natural selection can hinder the detection of phenotype–genotype associations in natural populations.

对比种间受选择位点的分布

Investigating the role of natural selection on coding sequence evolution in salmonids through NGS data mining.

Identifying DNA regions evolving under the effect of either positive or stabilizing selection, and quantifying the intensity of the selective pressures at play, represent major goals in evolutionary biology. The combination of next generation sequencing with a comparative genomics approach appears particularly promising towards yielding detailed insights into these issues. Here, we applied this strategy to investigate patterns of nucleotide substitutions in five species of the salmonid family (Salmo salar, Onchorynchus mykiss, Salvelinus fontinalis, Salvelinus namaycush, Coregonus clupeaformis) and compare this information with other fishes (Esox lucius, Danio rerio) for which genome information is available in order to infer the role of natural selection on the evolution of protein coding genes. We found evidence for positive selection across the seven fish genomes analysed. More precisely, we i) identified 707 orthologous genes and computed ratio of rates of divergence (dN/dS ratio) to identify those genes that deviated significantly from neutral expectations; ii) associated GO terms to the subset of 58 orthologs found to be under the influence of positive selection and identified 6 biological processes over-represented in this subset of positively selected orthologs, and iii) identified a subset candidates genes that have rapidly evolved under the influence of positive selection within salmonids, thus warranting further investigation in regards to their putative role in the process of adaptive divergence in salmonids.

The Complex Genetic Architecture of the Metabolome

The Complex Genetic Architecture of the Metabolome

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1001198

Understanding how genetic variation can control phenotypic variation is a fundamental goal of modern biology. We combined genome-wide association mapping with metabolomics in the plant Arabidopsis thaliana to explore how species-wide genetic variation controls metabolism. We identified numerous naturally-variable genes that may influence plant metabolism, often clustering in “hotspots.” These hotspots were proximal to selective sweeps, regions of the genome showing decreased diversity possibly from a strong selective advantage of specific variants within the region. This suggests that metabolism may be connected to the selective advantage. Interestingly, metabolite variation in wild Arabidopsis is highly constrained despite the significant genetic variation, thus providing the plant un-sampled metabolic space if the environment shifts. The observed structuring of genetic and metabolic variation suggests individual convergence upon similar phenotypes via different genotypes, possibly intra-specific parallel evolution. This phenotypic convergence couples with a pattern of genotype—phenotype association consistent with metabolite variation largely controlled by numerous small effect genetic variants. This supports the supposition that large magnitude variation is likely unstable in a complex and interconnected metabolism. If this pattern proves generally applicable to other species, it could present a significant hurdle to identifying genes controlling metabolic trait variation via genome-wide association studies.

Converting R contingency tables to data frames

http://toddjobe.blogspot.com/2010/08/converting-r-contingency-tables-to-data.html

Web GIS in practice - GIS 应用使用指导,很好

Web GIS in practice:

http://www.ij-healthgeographics.com/series/1476-072X-Gis

2011年3月30日星期三

install R/bioconductor package locally - an example

Here I show an example for installing R/bioconductor package locally.

You need to specify the full path for R you want to install the package to, and the full path for library path the package will be install, and also the folder of the uncompressed files of R/bioconductor package.

In this example,
full path for R:/ebio/abt6/jmao/bin/R-2.13/R-alpha/bin/R
package folder: BSgenome/
library path: /ebio/abt6/jmao/bin/R-2.13/Rpacks

jmao@sever:~/bin/R-2.13$ /ebio/abt6/jmao/bin/R-2.13/R-alpha/bin/R CMD INSTALL BSgenome/ --library=/ebio/abt6/jmao/bin/R-2.13/Rpacks
* installing *source* package ‘BSgenome’ ...
** R
** inst
** preparing package for lazy loading

Attaching package: 'IRanges'

The following object(s) are masked from 'package:base':

Map, cbind, eval, intersect, mapply, order, paste, pmax, pmax.int,
pmin, pmin.int, rbind, rep.int, setdiff, table, union

** help
*** installing help indices
** building package indices ...
** testing if installed package can be loaded

* DONE (BSgenome)

R CMD INSTALL - I need to know more about that

$ R CMD INSTALL --help
Usage: R CMD INSTALL [options] pkgs

Install the add-on packages specified by pkgs. The elements of pkgs can
be relative or absolute paths to directories with the package (bundle)
sources, or to gzipped package 'tar' archives. The library tree
to install to can be specified via '--library'. By default, packages are
installed in the library tree rooted at the first directory in
.libPaths() for an R session run in the current environment

Options:
-h, --help print short help message and exit
-v, --version print INSTALL version info and exit
-c, --clean remove files created during installation
--preclean remove files created during a previous run
-d, --debug turn on script and build-help debugging
-l, --library=LIB install packages to library tree LIB
--no-configure do not use the package's configure script
--no-docs do not install HTML, LaTeX or examples help
--html build HTML help
--no-html do not build HTML help
--latex install LaTeX help
--example install R code for help examples
--use-zip-data collect data files in zip archive
--fake do minimal install for testing purposes
--no-lock, --unsafe
install on top of any existing installation
without using a lock directory
--pkglock use a per-package lock directory
--build build binaries of the installed package(s)
--data-compress= none, gzip (default), bzip2 or xz compression
to be used for lazy-loading of data
--resave-data re-save data files as compactly as possible

for Unix
--configure-args=ARGS
set arguments for the configure scripts (if any)
--configure-vars=VARS
set variables for the configure scripts (if any)
--libs-only only install the libs directory
--no-multiarch build only the main architecture
--install-tests install package-specific tests (if any)
--no-R, --no-libs, --no-data, --no-help, --no-demo, --no-exec,
--no-inst
suppress installation of the specified part of the
package for testing or other special purposes

and on Windows only
--auto-zip select whether to zip data automatically

Which of --html or --no-html is the default depends on the build of R:
for this one it is --no-html.

Report bugs to .

iTOL - 交互式建树方法

进化树做得非常漂亮

http://itol.embl.de/

2011年3月28日星期一

download TAIR8 gff files

ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release/TAIR8_gff3/

2011年3月27日星期日

locally install R/bioconductor package

Locally install R/bioconductor package, means not link to web in the process of packages installaion. This is really easy for warm hands, but not for a new comer.

Here, an example. install a packages created by myself, the package tar.gz file is "BSgenome.Alyrata.JGI.v1_1.0.tar.gz".

install.packages(pkgs="BSgenome.Alyrata.JGI.v1_1.0.tar.gz", repos=NULL)

install R without root

Here I show codes for installation R without root. This is especially attractive for who works on server without root.

1. first make directories for (1) R downloading/installation and (2) R packages installation of potential R you will install.

mkdir R # create folder for potential R
mkdir Rpacks # create folder for R packages installation of potential R

cd R
wget http://cran.cnr.berkeley.edu/src/base/R-2/R-2.12.2.tar.gz # download R
tar -xvzf R-2.12.2.tar.gz # uncompress it
cd R-2.12.2

less INSTALL # read instruction file for installation

# configure and installation
./configure
make

# check for proper installation
make check # this may not so necessary.

# else your can invoke your R by type in R in the bin folder of the installation path.


2. Followings are some texts for INSTALL file for instruction of installation
As you are reading this file, you have unpacked the R sources and are
presumably in the top directory. Issue the following commands:

./configure
make

(If your make is not called `make', set the environment variable MAKE to
its name, and use that name throughout these instructions.)
This will take a while, giving you time to read `R-admin.html'.

Then check the built system worked correctly, by

make check

and make the manuals by (as many options as preferred from)

make dvi to create DVI versions
make pdf to create PDF versions
make info to create info files

However, please read the notes in `R-admin.html' about making the
reference manual.

3. the final step is setting up the path for your R in the .bash_profile. This would be done by adding lines like the followings in the file .bash_profile, and then source it.

alias r='/your path/R/R-2.12.2/bin/R'
export R_LIBS=/your path/Rpacks

How to run R on a server without X11, and avoid broken dependencies

http://stackoverflow.com/questions/1710853/how-to-run-r-on-a-server-without-x11-and-avoid-broken-dependencies

install R package without root

R is a nice compannon for stataitical people.From time to time, I have to install various packages to accomplish some tasks. At my desktop (linux server), I have the root access, the installation is pretty simple,

(1)Start R console and fire the installation command,

sudo R

>install.packages(“boa”)

or you can download the package and install it using the command line

sudo R CMD INSTALL boa_1.1.7-2.tar.gz

(2)When working at other server, apparently, I have no root access, have to install packages at my home dir

make a R local dir, download the package and install it

mkdir ~/myRdir

R CMD INSTALL ctv.tar.gz -l $HOME/myRdir

to use it, you have to defined R_LIB environmental variable, I put it at ~/.bashrc

R_LIBS=/home/USER/R
export R_LIBS

then load it is you need as

library(“boa”)

2011年3月26日星期六

exploRase - R GUI multivariate analysis of systems biology data

exploRase is a MetNet tool written in R for the exploratory multivariate analysis of Systems Biology data. It provides a graphical user interface (GUI) on top of the analysis functionality provided by R and the Bioconductor project. exploRase is designed to be accessible to biologists analyzing omics data in the context of metabolic and regulatory networks.

http://metnet.vrac.iastate.edu/MetNet_exploRase.htm

2011年3月24日星期四

automatically find the path for a specific file name

# automatically find the path for a specific file name
find -type f -name "insertion.txt"

# count how many files found
find -type f -name "insertion.txt" | wc -l

cope lined from multiple files of different folders

Q: I have 100 directories and each directory has a file with some name pattern, in it. I need to output the 15th line counted from the bottom of each file. How can I do this without copying all those files into another single directory.

A_1:
for i in `find . -type f -name "pattern"`;
do
tail -15 $i|head -1 >>redirect_somewhere
done

A_2:

Setting up R and LATEX

Setting up R and LATEX

http://statmath.wu.ac.at/software/R/qfin/

Yilmaz Ceylan

Stefan Theußl

February 10, 2011

Contents

[PDF version of this manual]

1 Introduction

This article explains how to set up R and LATEX on various operating systems. We provide a set of instructions for the three major desktop operating systems Windows, Mac OS X, and Linux in Sections 2, 3, and 4, respectively. Section 5 includes references to related articles and further resources. For more information about R please visit the website of the R project at http://www.r-project.org/. Further questions not covered by this article and feedback may be sent to statmath-qfin_AT_wu.ac.at.

2 Windows

If you are using Windows you should follow the steps below in order to successfully setup a basic R development environment.

  1. Install the current version of R (available from http://CRAN.R-project.org/bin/windows/base/).
  2. Set up a working LATEX environment (e.g., MiKTEX http://miktex.org/; preferably installing all available packages).
  3. Install the R for Windows toolchain (known as R-Tools, available from http://www.murdoch-sutherland.com/Rtools/).
  4. Update your environment variables.

LATEX Environment

The most popular option for setting up a LATEX environment on Windows is installing the MiKTEX distribution. We recommend to download the net installer from http://miktex.org/, execute it, and install MiKTEX including all packages (full installation).

Installing the R for Windows Toolchain

R-Tools includes various software needed for successfully building R packages (typically packages containing C or FORTRAN source code). It provides Windows with a set of compilers and various other tools and libraries.

Not included in the R-Tools package is the Microsoft htmlhelp compiler. It is used to produce compiled HTML sites of R help pages. You may download the software from http://msdn.microsoft.com/en-us/library/ms669985(VS.85).aspx.

Updating Environment Variables

Presumably the most important step is to update the PATH environment variable.

For doing so go to: Start Settings Control Panel System Advanced.
Search for the PATH variable and change it appropriately. The paths to the ‘bin’ directories of the following software have to be included (if they were not included automatically):

  • R-Tools
  • MiKTEX
  • htmlhelp
  • R

Note that the order of the entries is important, thus “copy and paste” the paths in the order seen above. Separate them by semicolon. Do not delete other entries, Windows may need them.

Links, from where you can download suitable text editors:

3 MAC OS X

In order to successfully setup a basic development environment for R- related projects under MAC OS X you need to follow these four steps:

  1. Install the current version of R (available from http://CRAN.R-project.org/bin/macosx/ ).
  2. Install a LATEX environment (e.g. TeXShop, teTeX).
  3. Install XCode.
  4. Install the GNU Fortran compiler (gfortran).

LATEX environment

A readily usable environment for typesetting scientific articles with LATEX is provided with the MacTEX distribution available from http://www.tug.org/mactex/2009/. This distribution also includes a text editor named TeXShop.

XCode

Compilers are needed for installing R packages containing source code typically written in C or FORTRAN. A set of C compilers for Mac OS X is provided in XCode the basic development tools for Mac OS X installable either from the DVD containing the operating system that came with your Mac or, after downloading the latest version from the Apple Developer website at http://developer.apple.com/technology/xcode.html.

GNU FORTRAN compiler

Unfortunately, the compiler gfortran is not included in XCode. Luckily, since R version 2.5.0 the compiler is installed with the R Mac OS X installer (see Step 1: Installing the current version of R). Nevertheless, if you experience problems setting up the compiler you may install it manually using the packages provided at http://r.research.att.com/tools/.

4 Linux

As we cannot provide a complete set of instructions for every Linux distribution we concentrate on Debian-based distributions in this article. The most prominent of such Debian-based Linux distributions is called Ubuntu (http://www.ubuntu.com). Nevertheless, the instructions provided in this section should usually work on any Debian-based system.

In order to successfully setup a basic development environment for R under Debian you need to follow these three steps:

  1. Install the current version of R (r-base and r-recommended packages).
  2. Install a LATEX environment (e.g., TeX Live available by installing the texlive-full package).
  3. Install compilers and other development packages.

Installing R

R is provided in binary form as Debian packages called r-base. It can be installed either by using the synaptic package manager or by issuing the following command on the command prompt:

sudo aptitude install r-base r-recommended

For installing more recent versions of R please follow the steps provided in please visit the R on Debian or R on Ubuntu websites http://www.ubuntu.com.

LATEX environment

A readily usable environment for typesetting scientific articles with LATEX is provided with the TeX Live distribution available from www.tug.org/textlive/. Luckily, one can install the complete distribution with the texlive-full package very easily. It can be installed either using the synaptic package manager or by issuing the following command:

sudo aptitude install texlive-full

Compilers and Other Tools

Typically, compilers and other development tools are needed in a reasonable development environment. Those tools are made available on Debian-based system simply by installing the r-base-dev package. It can be installed either using the synaptic package manager or by issuing the following command:

sudo aptitude install r-base-dev

5 Further Information

More information about the topic can be retrieved from the following references.

6 Frequently Asked Questions

  1. Windows (MiKTEX):
    1. How to make MiKTEX recognize ‘Sweave.sty’?

      Answer: Go to your R Installation directory, i.e. the directory where R has been installed to. We call this directory ‘RHOME’. Search for the LATEX style sheet called ‘Sweave.sty’. Typically it is located in the folder ‘<RHOME>\share\texmf\’. Copy the content of this folder to the ‘Sweave’ folder in your “personal” LATEX library. The location of your LATEX library can be seen by going to Start All Programs MiKTeX <Version>Maintenance Settings.

      Go to the Roots tab, and mark the “Show MiKTeX-maintained root directories”. Look at the Description where it says “UserInstall,UserConfig”. The coresponding Path points to the location. Usually it is in ‘<RHOME>\Application Data\MiKTeX\<Version>\tex\latex\’, (<Version> has to be replaced with the installed MiKTEX version), if you don’t have the ‘latex’ and/or the ‘Sweave’ folder, please create it.

    2. How to install further LATEX packages?

      Answer: Go to Start All Programs MiKTeX<Version> Maintenance(Admin) Package Manager(Admin).

      Search for the package you need, then right click on the package name and click the Install.

    3. MikTEX offers downloading missing packages but fails.

      Answer: Note that you have to be able to write to the system library of MiKTEX. Typically this can only be achieved when you login as an Administrator.

7 Acknowledgment

An earlier version of this manual was written by Ksenia Fraczek.

how to forge BSgenome data package

here are some references

http://permalink.gmane.org/gmane.science.biology.informatics.conductor/33026

extract sequences from genome by using R/bioconductor

please reference the following sites for more deep discussions

http://biostar.stackexchange.com/questions/979
http://biostar.stackexchange.com/questions/6358/how-to-extract-specific-coordinates-from-multifasta-file

####################################################
# install packages necessary
# use Arabidopsis thaliana as an example
source("http://www.bioconductor.org/biocLite.R")
biocLite("BSgenome.Athaliana.TAIR.04232008")

# load the package
library(BSgenome.Athaliana.TAIR.04232008)

# touch the funtionalities
Athaliana
seqlengths(Athaliana)
Athaliana$chr1
class(Athaliana)
organism(Athaliana)
alphabetFrequency(Athaliana$chr1)

# get the sequences
start = runif(100, min=1, max =length(Athaliana$chr1)-200)
width = runif(100, min=30, max=200)
result = getSeq(Athaliana, names=chr1, start=start, end=start)
cbind(start, result)

read compressed file into R - very easy

R tip: Save time and space by compressing data files

http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

git

# git can download the files for you. here shore was downloaded.

git clone git://shore.git.sourceforge.net/gitroot/shore/shore

Zcat, Zless, Zgrep, Zdiff

From - http://www.thegeekstuff.com/2009/05/zcat-zless-zgrep-zdiff-zcmp-zmore-gzip-file-operations-on-the-compressed-files/

In this article let us review how to perform normal file operation on a compressed files using the powerful Linux Z commands.

Some of these z commands uncompresses the file temporarily in the /tmp directory to perform the specified operation. Some of the z commands uncompresses it on the fly to perfom the specified operation. But, under any case, z commands gives the peace of mind, as you don’t want to worry about the overhead of uncompressing the compressed file to perform an operation.

You can do the following normal file operations on the compressed file
  1. Viewing the compressed file with zcat.
  2. Paging the compressed file with zless / zmore.
  3. Searching inside the compressed file with zgrep / zegrep.
  4. Comparison of file using zdiff / zcmp

Example 1: View Compressed File and Uncompress with zcat

Compressing a file using gzip creates a compressed file with *.gz extension. You can view a compressed file with zcat with the following way. Which would be as same as the uncompressed file operation ‘cat filename’. zcat uncompresses the file and shows it in the stdout.

$ zcat filename.gz | more

$ ls -l big-file.*
-rw-r--r-- 1 ramesh ramesh 24853275 May 9 15:14 big-file.txt

$ gzip big-file.txt
[Note: Compress the file]

$ ls -l big-file.*
-rw-r--r-- 1 ramesh ramesh 9275204 May 9 15:14 big-file.txt.gz

$ zcat big-file.txt.gz
[Note: View the file without uncompressing it]

zcat big-file.txt.gz > big-file.txt
[Note: Uncompress the file]

Example 2: View a gzipped file which don’t have the gz suffix.

You can uncompress a gzipped file which don’t have the gz suffix. If you try to uncompress a gzipped file which don’t have the gz suffix with “gunzip” or “gzip -d” command you will face the following error.

gunzip: auth.log: unknown suffix -- ignored

But this zcat will uncompress the file and shows the content as shown below.

$ cat > test-file.txt
This is a test file used for gunzip and zcat testing

zcat is awesome command.

$ gzip test-file.txt

$ mv test-file.txt.gz test-file-no-ext

$ gzip -d test-file-no-ext
gzip: test-file-no-ext: unknown suffix -- ignored

$ zcat test-file-no-ext
This is a test file used for gunzip and zcat testing

zcat is awesome command.

Example 3: Display the file content without worrying about whether it is compressed or not

When you are not sure whether a file is compressed or not, you can still view the file without worrying about it’s compression status as shown below.

In this example, If the input-file is compressed zcat will display the content by uncompressing it. If the input-file is not compressed zcat will display the content as it is.

$ zcat -f input-file

Example 4: Paging the compressed file with zless / zmore.

You can paginate a compressed file with zless command or zmore command as shown below.

$ zcat filename.gz | more
$ zcat filename.gz | less

(or)

$ zless filename.gz
$ zmore filename.gz


Note: To open any kind of file type, refer to our previous article Open & View 10 Different File Types with Linux Less Command – The Ultimate Power of Less.

Example 5: Searching inside the compressed file with zgrep / zegrep.

You can search inside a compressed file with zgrep / zegrep as shown below. This would be as same as the uncompressed file operation ‘grep -i filename’. All the options to the zgrep command will be passed to grep, and the file will be fed to grep command. It may uncompress and feed the file to grep command if needed.

$ cat > test-file.txt
gzip, gunzip, zcat - compress or expand files
zless - file perusal filter for crt viewing of compressed text
zcmp, zdiff - compare compressed files

$ grep -i less test-file.txt
zless - file perusal filter for crt viewing of compressed text

$ gzip test-file.txt

$ zgrep -i less test-file.txt.gz
zless - file perusal filter for crt viewing of compressed text


Note: Become familiar with the grep command by reading our earlier article Get a Grip on the Grep! – 15 Practical Grep Command Examples.

Example 6: Comparison of file using zdiff / zcmp

You can compare two compressed files with zdiff / zcmp as shown below. This would be same as the uncompressed file operation ‘diff file1 file2′.

$ cat > file1.txt
This is line one
This is line two

$ cat > file2.txt
This is line 1
This is line two

$ diff file1.txt file2.txt
1c1
<> This is line 1

$ gzip file1.txt file2.txt

$ zdiff file1.txt.gz file2.txt.gz
1c1
<> This is line 1

2011年3月23日星期三

unix commands - how to remove write-protected

# here code for remove write-protected files

rm -f -r myfilefolder

2011年3月22日星期二

invoke unix command in R and using paste() function

(1) run Unix commands which has quotations ("") in it,
# in unix you can run the code here
awk '{print $1 "\t" $2 "\t" $3}' Bak-2.vcf | head -n3

# in R, you need to run like
system("awk '{print $1 \"\t\" $2 \"\t\" $3}' Bak-2.vcf | head -n3")

# here you need to use symbol (\) which can conserve quotations (") doing like the normal quotations

(2) paste()
# a question: how to generate a unix command like "awk '{print $1 "\t" $2 "\t" $3}' Bak-2.vcf | head -n3" with paste()?

# answer is
paste("awk '{print $1 \"\t\" $2 \"\t\" $3}' Bak-2.vcf | head -n3", sep ="")
# it should work, though it looks not exactly same with the command you use in unix terminal. There is " \" in the output of paste().

"awk '{print $1 \"\t\" $2 \"\t\" $3}' Bak-2.vcf | head -n3"

# it just can be used in system() function to invoke unix.

2011年3月21日星期一

awk and unix scripting

[1]

UNIX Scripting


http://www.tek-tips.com/threadminder.cfm?pid=822

[2]

AWK Forum

http://www.tek-tips.com/threadminder.cfm?pid=271

access shell variables in awk

[1] http://www.tek-tips.com/faqs.cfm?fid=1281

You can use any of the following 3 methods to access shell variables inside an awk script ...

1. Assign the shell variables to awk variables after the body of the script, but before you specfiy the input

awk '{print v1, v2}' v1=$VAR1 v2=$VAR2 input_file

Note: There are a couple of constraints with this method;
- Shell variables assigned using this method are not available in the BEGIN section
- If variables are assigned after a filename, they will not be available when processing that filename ...

e.g.

awk '{print v1, v2}' v1=$VAR1 file1 v2=$VAR2 file2

In this case, v2 is not available to awk when processing file1.

2. Use the -v switch to assign the shell variables to awk variables. This works with nawk, but not with all flavours of awk. On my system (Solaris 2.6) -v cannot be used with /usr/bin/awk but will work with /usr/xpg4/bin/awk.

nawk -v v1=$VAR1 -v v2=$VAR2 '{print v1, v2}' input_file

3. Protect the shell variables from awk by enclosing them with "'" (i.e. double quote - single quote - double quote).

awk '{print "'"$VAR1"'", "'"$VAR2"'"}' input_file


[2] http://zvfak.blogspot.com/search/label/awk

If you have a shell variable in a bash script you can't pass it to AWK just by putting "$" sign in front of it, but you can enclose them with "'" in AWK code and they will be used in AWK with no problem.

for example you have a bed file called "example.bed":

$ cat example.bed
chr1 1000 2000 id1
chr1 4000 5000 id2
chr1 5500 6000 id3

Let's say you want to concatenate a string (in this case "brain_" string) to column 4 of this file, you can do this in AWK as follows:

$ awk '{OFS="\t";$4="brain_"$4; print;}' example.bed
chr1 1000 2000 brain_id1
chr1 4000 5000 brain_id2
chr1 5500 6000 brain_id3

however if you store the string in a variable as follows in the terminal or in a bash script:

$ TISSUE="brain_"

the following will not work,
$ awk '{OFS="\t";$4=$TISSUE$4; print;}' example.bed

but this will :

$ awk '{OFS="\t";$4="'"$TISSUE"'"$4; print;}' example.bed
chr1 1000 2000 brain_id1
chr1 4000 5000 brain_id2
chr1 5500 6000 brain_id3

learning regular expression easily

http://www.regular-expressions.info/reference.html

powerful tools for recombination

(1)A New Method to Reconstruct Recombination Events at a Genomic Scale

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001010

(2) Laxmi Parida
http://www.cs.nyu.edu/~parida/res/res.html

2011年3月20日星期日

unix commands - display / read / print out a specific line of a file

# line number is the number of line you want
sed -n "${line_number}p" $filename

# here the number 2 can be substitute with the number of line you want,
# tail -1 control how to print out the line, lines around the specific one.
cat filenames | head -2 | tail -1
head -2 filenames | tail -1
head -n 2 filenames | tail -1
sed -n 2p filenames

# print out a range of lines by line-numbers
awk 'FNR>=20 && FNR<=40' filename awk 'FNR>=101264075 && FNR<=101264080' filename

# zcat for displaying the compressed file
zcat filename | head -101264077 | tail -5

zcat filename | head -99327817 | tail -5

# print the 5th line of file file1.txt
sed -n '5 p' file1.txt

2011年3月19日星期六

uncompress .tar.bz2 files

# it works for me

bunzip2 -c archive.tar.bz2 | tar -xvf -

R plot in ZIM desktop wiki

(1) the blog
http://pvanb.wordpress.com/2011/03/04/r-plot-in-zim-desktop-wiki/

(2) zim
http://www.zim-wiki.org/index.html

manuals in UC riverside - instructive for bioinformatician

http://manuals.bioinformatics.ucr.edu/



http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Rsamtools-SNP-Calling

a study on metabolic network


Copy number alterations among mammalian enzymes cluster in the metabolic network

http://mbe.oxfordjournals.org/content/early/2010/11/03/molbev.msq296.abstract

Gephi - for all kinds of networks and complex systems, dynamic and hierarchical graphs

Gephi, is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. It seems rather feature rich, with built in connectors to database systems, extensive graph coloring, layout, and rendering features, and several analysis tools. 复杂问题作图,有数据库接口。

(1)
Runs on Windows, Linux and Mac OS X. Gephi is open-source and free.

http://gephi.org/

(2) an interactive example
http://diseasome.eu/map.html

(3)
M. Hahsler, S. Chelluboina: Visualizing Association Rules: Introduction to the R-extension Package arulesViz

2011年3月18日星期五

pint

pint

Pairwise INTegration of functional genomics data

http://bioconductor.org/packages/2.7/bioc/html/pint.html

pretty-r: highlight your R code

pretty-R:
http://www.inside-r.org/pretty-r/tool

file.show(), page() and more()

(1) page{utils}
Displays a representation of the object named by x in a pager via file.show.

(2) file.show {base}
Display one or more files.
file.show(file.path(R.home("doc"), "COPYRIGHTS")
(3) more function
http://www.imachordata.com/?p=645

2011年3月17日星期四

on recombination estimating

Two tools here for estimation of recombination:

1. RDP
http://darwin.uvigo.es/rdp/rdp.html
Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P. (2010). RDP3: a flexible and fast computer program for analyzing recombination. Bioinformatics 26, 2462-2463.

2. IRiS algorithm

it is not a program

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001010

Recombination brings together DNA sequences that can be very distantly related, and, thus, quite different from each other. This is often cited as a main hurdle for using recombining regions (that is, most of the genome) to reconstruct sequence phylogeny. We have turned this argument around: chromosomes carrying a similar change in sequence pattern are likely to be descendants of the same recombination event, and thus, related. We have devised an algorithm that detects such changes in sequence patterns and identifies the descendants of a recombination event. After some fine-tuning, we have applied it to sequence data in several human populations and have found that recombination events recapitulate the history of these populations. This opens the possibility of adding recombination to the current allele-based analysis of population structure and history. Our method also provides a tool for the genomic analysis of recombination, both because it pinpoints recombination events rather than just estimating recombination rates, and because, being biased towards more recent events, it can offer a glimpse of the fast evolution of recombination.

Plant Interactome Project

http://interactome.dfci.harvard.edu/index.php?page=home

GeneAnswers - A collection of bioconductor methods to visualize gene-list annotations

GeneAnswers, integrates GO, DO, KEGG, and caBIO(NCI, Reactome and Biocarta) as well as gene interaction and entrez keywords search for gene set enrichment analysis. Finally, it can automatically generate a html report for one or multiple groups of gene set enrichment analysis with interactive networks.

http://www.bioconductor.org/packages/2.5/bioc/html/GeneAnswers.html

http://www.biomedcentral.com/1756-0500/3/10/abstract

SPIA, a package that seems to provide sophisticated analysis of pathways (including both, gene set enrichment analysis and pathway perturbation analysis)

http://bioinformatics.oxfordjournals.org/content/25/1/75.abstract

http://www.bioconductor.org/packages/devel/bioc/html/SPIA.html

2011年3月15日星期二

predict Amino Acid exchanges using the NGS genomic data

由碱基变异预测氨基酸变异。


Here are a few standalone tools:

http://snpeff.sourceforge.net/

http://www.openbioinformatics.org/annovar/

ftp://ftp.ensembl.org/pub/misc-scripts/Variant_effect_predictor_1.0/

http://sift.jcvi.org/

SyMAP v3.4: a turnkey synteny system with application to plant genomes

http://nar.oxfordjournals.org/content/early/2011/03/11/nar.gkr123.full?sid=439ab412-7f3c-45ea-9e86-4ff1d0e92a77

compute synteny blocks between a pair of sequenced genomes.
计算一对基因组间的共线性区域。有很好的可视化功能。

2011年3月14日星期一

Adam Auton

Adam, has done much in estimating recombination rate and RR hotspot by using population genomic data.

http://www.adamauton.com/research.shtml

hzAnalyzer - detection, quantification, and visualization of contiguous homozygosity in high-density genotyping datasets

http://genomebiology.com/2011/12/3/R21/abstract

hzAnalyzer, 一个新R package,用来预测、评价、可视化基因组中的连续同源区。

Molecular footprints of local adaptation in two Mediterranean conifers

http://www.ncbi.nlm.nih.gov/pubmed/20656795

they correlated the snps variation at candidate genes and climatic variables. But, in my view, it is not the good correlated they implemented in this study. They should use Bayesian-based tech, it is more robust.

2011年3月11日星期五

Genome-wide patterns of population structure and admixture among populations

Genome-wide patterns of population structure and admixture among Hispanic/Latino populations

PNAS May 5, 2010
http://www.pnas.org/content/early/2010/05/05/0914618107.abstract

作者用了非beyesian 方法来找群体遗传结构。

http://onlinelibrary.wiley.com/doi/10.1002/gepi.20064/abstract

Structure. We used the software FRAPPE, which implements an
expectation-maximization algorithm for estimating individual membership
in clusters (24)

EIGENSOFT

EIGENSOFT -
The EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification method (Price et al. 2006).
群体遗传结构与主成分分析

http://genepath.med.harvard.edu/~reich/Software.htm

characterize the selective sweeps in recent human evolution

Classic Selective Sweeps Were Rare in Recent Human Evolution

http://www.sciencemag.org/content/331/6019/920.abstract

文中详细对比了,选择在基因组中的分化。并把它和exons,nocoding,amino acid subsitutions 联系起来。

Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of “classic selective sweeps” (in which a beneficial mutation arises and rapidly fixes in the population).Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent-sweep model, we found that diversity levels decrease near exons and conserved noncoding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched in alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of human adaptation over the past ~250,000 years.

meta-analysis - not so far from you

Getting started with meta-analysis

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00056.x/abstract

Bayesian SDM - more things we can touch

Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00077.x/full

Conclusion:

We then show how applying our correction to multiple co-occurring species simultaneously increases the accuracy of parameter estimates for each species, as well as estimates for the true environment at each survey plot – a phenomenon we call ‘neighbourly advice’. With a sufficient number of species, the estimates of the true environment at each plot can become extremely accurate.

Our correction for regression dilution could be integrated with models addressing other issues in SDM, e.g. biotic interactions and/or spatial dynamics. We suggest that Bayesian analysis, as employed here to address uncertainty in predictor variables, might offer a flexible toolbox for developing such next-generation species distribution models.

2011年3月10日星期四

gtools - good tools of manipulating data

like gplots good at ploting, gtools is very good at data manipulation.

here is the example,
http://learnr.wordpress.com/2009/05/31/gtools-usefuldata-manipulation-functions/

books in Chinese and English

1.

女士品茶:20世纪统计怎样变革了科学

The lady tasting tae - how statisticians revolutionized science in the twentieth centry
http://www.book118.com/eshu/4/46/79679.html

2.

生命乐章
The music of life -
http://www.wl.cn/6620763

adapting starting values in MCMC using paste()

here is the tip on this topic:

http://jointposterior.blogspot.com/2011/01/quickly-adapt-starting-values-in-mcmc.html

tables in heatmap - good functionality in gplots package

as showed in the blog:

http://socialdatablog.com/heatmap-tables

Now, it is not difficult to display tables in heatmap.

Also, we do need to remember gplots package.

germane

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources).

germane /dʒɜːˈmeɪn/ DJ listen /dʒɜːrˈm-/ DJ US listen /dʒɝ'men/ KK US
of ideas, remarks, etc. 想法、言语等 connected with something in an important or appropriate way 与…有密切关系;贴切;恰当 adjective not usually before noun ~ (to sth) formal

2011年3月8日星期二

GenABEL: an R library for Genome-wide association analysis

GenABEL
http://mga.bionet.nsc.ru/~yurii/ABEL/GenABEL/

2011年3月7日星期一

variogram

variogram

  1. 变量图 SIASlit Island Analysis剖面位形法以及自仿射的变量图Variogram 法 这些方法量测的表 166.111.121.20 - Related search
  2. 变差函数 variance 方差 variancecovariance matrix 方差协方差矩阵 variogram 变差函数 www.geochina.net - Related search
  3. 方差图 variogram 方差图 variograph 变差图变压计 variohm 可变电阻器 variole 球颗 www.oilnews.com.cn - Related search

2011年3月6日星期日

Choropleth (等值线图,等高线图) - useful to visualize spatial trends

Here, websites shows how to plot Choropleth:

http://rstats.posterous.com/choropleths-made-easy

http://offensivepolitics.net/blog/category/rstats/

http://media.apps.chicagotribune.com/election-maps/mayor-2011/index.html#41.85358024690839,-87.63826971435547,11

notes on editors in Unix

1. cat
cat (concatenate),
(1) for displaying the content of a text file
(2) redirecting the output of command into a file by using >

press enter let you begin a new line.
press ctrl + D let you end up the file.

2.

notes on awk

# find
find ~ -name "yourfile.txt"

"~" the shortcut for the path to your home directory

2011年3月3日星期四

drawing a haplotype network with HaploNet

drawing a haplotype network with HaploNet

Here, examples

library(pegas)
example(haploNet)

good tips for Unix

(1) http://freeengineer.org/learnUNIXin10minutes.html

(2) http://www.cs.usfca.edu/~parrt/course/601/lectures/unix.util.html

(3) SHELLdorado - your UNIX shell scripting resource

http://www.shelldorado.com/links/index.html#tutorials

sed for you and me

SED, really a powerful tool in Unix. I just begin at learning it.

here, a good tutorial
http://www.grymoire.com/Unix/Sed.html

extract columns from or subset by columns of a big file

here two means:

(1). cut
# extract several columns from a tab-delimited file
# You can try to place a tab between the quotes if you first press " v" then the "" key:
cut -f 1-10 -d "v "

# extract column 1 and column 3 from big_file to a subset_file
cat big_file | cut -f 1,3 -d " " > subset_file

(2). awk
# extract column 1 and 3 with all cell including M of big_file to subset_file
cat Agu-1_fv.txt | grep M | awk '{ print $1, $3}' > subset_file

2011年3月2日星期三

windows softwares installed in Mac OS - DIVA GIS as an example

here for reference to install Windows softwares in Mac OS

http://www.diva-gis.org/diva_on_osx

2011年3月1日星期二

tables,

#################################################################

xtabs

Cross Tabulation

Description

Create a contingency table (optionally a sparse matrix) from cross-classifying factors, usually contained in a data frame, using a formula interface.


####################################################################

table {base} R Documentation
Cross Tabulation and Table Creation

Description

table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.