2011年4月28日星期四

print out R vector object as sequence

# example code
x<-1:10
paste(x, sep="", collapse=",")

# output
# Note using paste() you can get different print of the vector object,
# Pls, compare these two
> paste(x, sep="", collapse=",") 
[1] "1,2,3,4,5,6,7,8,9,10"
> x 
[1]  1  2  3  4  5  6  7  8  9 10
 
##########################################
Rvector = function(vector) {
out = paste("c(",paste(x, sep="", collapse=","),")",sep="")
cat("\n",out,"\n","\n")
return(out)
}

Rmatrix = function(matrix) {
out = paste("matrix(", Rvector(as.matrix(mat)), "," ,nrow(mat), ",", ncol(mat), ")", sep="")
cat("\n",out,"\n","\n")
return(out)
} 

moving average - like sliding window calculation of mean

http://rforcancer.drupalgardens.com/content/ggheat-ggplot2-style-heatmap-function

ggheat - a ggplot2 style heatmap function

http://rforcancer.drupalgardens.com/content/ggheat-ggplot2-style-heatmap-function

gnmplot - between the biomaRt and ggplot2 packages

a new package that will create nice publication quality graphics of genome information. It's really an adaptor sitting between the biomaRt and ggplot2 packages

http://rforcancer.drupalgardens.com/content/gnmplot

Creating Repetitive Reports

http://learnr.wordpress.com/2009/09/09/brew-creating-repetitive-reports/

determine the number of cluster

http://blog.echen.me/2011/03/14/counting-clusters/

2011年4月27日星期三

find the full for a specific file name

find the full for a specific file name in a folder,  this may be simple and useful.

# list the full path for a file name "mao_test.txt", in the current folder (path)
find -name 'mao_test.txt'

# list the full path for a file name "mao_test.txt", in the root folder (the computer)
find / -type f -name 'mao_test.txt'

2011年4月25日星期一

Tools for visualizing overlap between GO terms

http://biostar.stackexchange.com/questions/1278/tools-for-visualizing-overlap-between-go-terms

Tool to generate proportional Venn diagrams

Here, you can find several tools of that.
http://biostar.stackexchange.com/questions/7736/tool-to-generate-proportional-venn-diagrams

VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R

http://www.biomedcentral.com/1471-2105/12/35 

 

 

2011年4月23日星期六

Bayesian inference in ecology

Ellison, A.M. (2004) Bayesian inference in ecology. Ecology Letters, 7, 509520.

a friend left words to my publication and me - thanks for their best wishes

 
很好!

凡是坚持的,
必能实现!

凡是美好的,
必会璀璨!


http://www.planta.cn/forum/viewtopic.php?t=25503

avoiding statistical problem - A protocol for data exploration to avoid common statistical problems

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2009.00001.x/abstract

Summary

1. While teaching statistics to ecologists, the lead authors of this paper have noticed common statistical problems. If a random sample of their work (including scientific papers) produced before doing these courses were selected, half would probably contain violations of the underlying assumptions of the statistical techniques employed.
2.  Some violations have little impact on the results or ecological conclusions; yet others increase type I or type II errors, potentially resulting in wrong ecological conclusions. Most of these violations can be avoided by applying better data exploration. These problems are especially troublesome in applied ecology, where management and policy decisions are often at stake.
3.  Here, we provide a protocol for data exploration; discuss current tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, double zeros in multivariate analysis, zero inflation in generalized linear modelling, and the correct type of relationships between dependent and independent variables; and provide advice on how to address these problems when they arise. We also address misconceptions about normality, and provide advice on data transformations.
4.  Data exploration avoids type I and type II errors, among other problems, thereby reducing the chance of making wrong ecological conclusions and poor recommendations. It is therefore essential for good quality management and policy based on statistical analyses.

2011年4月20日星期三

plot network in R - ggplot2 and qgraph

tips of ggplot2:
 http://r-ecology.blogspot.com/2011/03/basic-ggplot2-network-graphs-ver2.html


qgraph:
https://sites.google.com/site/qgraphproject/home

Creating graphs where the nodes are images

你可以选取一个图片,作为R作图中的结点。想做出出色的图,这个话题有吸引力:

http://stackoverflow.com/questions/4975681/r-creating-graphs-where-the-nodes-are-images/4978111#4978111

correlation network

correlation network

 the theory:
http://en.wikipedia.org/wiki/Graph_%28mathematics%29

the blog:
http://www.investuotojas.eu/?p=464

knowledgeblog will change the academic publising process

knowledgeblog has put forward a blog-based academic publication strategy. That is very interesting and attractive. I would like to take part in.

But, it looks it is just on the its beginning. No real case/example there. 

http://knowledgeblog.org/

generating animation in R

generating animation in R

http://eigensomething.blogspot.com/2011/01/video-from-kaggle-traffic-prediction.html
http://blog.revolutionanalytics.com/2009/06/animate-r-graphics-with-flash.html
http://yihui.name/en/2009/06/creating-tag-cloud-using-r-and-flash-javascript-swfobject/

Flash Tip: Embedding Your SWF in a Web Page

http://animation.about.com/od/flashanimationtutorials/qt/embedswfwebpage.htm

2011年4月19日星期二

detecting population structure by PCA

(1) several papers:

A genealogical interpretation of principal components analysis 

http://www.ncbi.nlm.nih.gov/sites/entrez/19834557?dopt=Abstract&holding=f1000,f1000m,isrctn

Genome-wide patterns of population structure and admixture in West Africans and African Americans

http://www.ncbi.nlm.nih.gov/sites/entrez/20080753?dopt=Abstract&holding=f1000,f1000m,isrctn

Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis

http://www.ncbi.nlm.nih.gov/sites/entrez/20862358?dopt=Abstract&holding=f1000,f1000m,isrctn


(2) adegent


- a review of applications of multivariate analyses to genetic markers data:


Jombart T, Pontier D, Dufour AB. (2009) Heredity 102: 330-341. doi:10.1038/hdy.2008.130. [link to the journal's pdf - free abstract] Genetic markers in the playground of multivariate analysis.


- the paper presenting the spatial principal component analysis (sPCA, function spca), global and local tests (global.rtest and local.rtest):
Jombart T, Devillard S, Dufour AB, Pontier D (2008) Revealing cryptic spatial patterns in genetic variability by a new multivariate methodHeredity 101: 92-103. doi: 10.1038/hdy.2008.34 [link on the journal's website - free abstract]


- the paper presenting the SeqTrack algorithm (seqTrack), and simulations of genealoies of haplotypes (haploGen):
 

Jombart T, Eggo RM, Dodd PJ, Balloux F (2010) Reconstructing disease outbreaks from genetic data: a graph approach. Heredity. Doi: 10.1038/hdy.2010.78 

- the paper introducing the Discriminant Analysis of Principal Components (DAPC, functions find.clusters and dapc): 

Jombart T , Devillard S and Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11:94. doi:10.1186/1471-2156-11-94 [free pdf] [free html version] [evaluation by Laurent Excoffier on F1000]

 

2011年4月18日星期一

Environmental factors predict adaptive phenotypic differentiation within and between two wild andean tomatoes

the article
Environmental factors predict adaptive phenotypic differentiation within and between two wild andean tomatoes

http://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2008.00332.x/full

the lab
http://sites.bio.indiana.edu/~moylelab/publications.html

two books of introductory bayesian statistics

As recommended by this blogger.
http://telliott99.blogspot.com/search/label/bayes

(1) a book by Dennis Lindley entitled Understanding Uncertainty

(2) further understanding of Bayesian methods is William Bolstad, Introduction to Bayesian Statistics.

generate invalid variable names

http://4dpiecharts.com/

用单引号扩起来,你可以创建一个R的无效变量名。记住,它也是有用的。

Rcpp and compiler - speed up R in R-2.13.0

see this for introduction

http://dirk.eddelbuettel.com/blog/2011/04/12/#the_new_r_compiler_package

about updating of R-2.13.0

http://yihui.name/cn/2011/04/r-updated-to-2-13-0/#more-1970

testing for different rates of continuous trait evolution - r8s and Brownie

O'Meara, B.C., C. Ané, M.J. Sanderson, and P.C. Wainwright. 2006. Testing for different rates of continuous trait evolution using likelihood. Evolution 60(5): 922-933.
http://www.brianomeara.info/publications

Rates of phenotypic evolution have changed throughout the history of life, producing variation in levels
of morphological, functional, and ecological diversity among groups. Testing for the presence of these rate shifts is
a key component of evaluating hypotheses about what causes them. In this paper, general predictions regarding changes
in phenotypic diversity as a function of evolutionary history and rates are developed, and tests are derived to evaluate
rate changes. Simulations show that these tests are more powerful than existing tests using standardized contrasts.
The new approaches are distributed in an application called Brownie and in r8s

UNICODE characters

The Adobe Symbol Encoding

http://www.stat.auckland.ac.nz/~paul/R/CM/AdobeSym.html

Drop unused factor levels

When you subset a data.frame object, you will face the problem of drop unused factor levels, this blog give us guides on that.

http://quantitative-ecology.blogspot.com/2008/02/drop-unused-factor-levels.html

2011年4月17日星期日

some simple plots on SNPs and Indels using ggplot2








Good guide on methods to find orthologous genes of a species

What is the best method to find orthologous genes of a species?

http://biostar.stackexchange.com/questions/7591/what-is-the-best-method-to-find-orthologous-genes-of-a-species

DAVID - an easy way to go from gene lists to functional information

The Database for Annotation, Visualization and Integrated Discovery (DAVID ) v6.7 is an update to the sixth version of our original web-accessible programs. DAVID now provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes. For any given gene list, DAVID tools are able to:


Identify enriched biological themes, particularly GO terms
Discover enriched functional-related gene groups
Cluster redundant annotation terms
Visualize genes on BioCarta & KEGG pathway maps
Display related many-genes-to-many-terms on 2-D view.
Search for other functionally related genes not in the list
List interacting proteins
Explore gene names in batch
Link gene-disease associations
Highlight protein functional domains and motifs
Redirect to related literatures
Convert gene identifiers from one type to another.
And more


http://david.abcc.ncifcrf.gov/summary.jsp

Numerous public sources of protein and gene annotation have been parsed and integrated into DAVID 6.7. DAVID 6.7 contains information on over 1.5 million genes from more than 65,000 species. A list of protein or gene identifiers can be uploaded all at once to extract and summarize functional annotation associated with group of genes or with each individual gene. Data can be displayed in chart or table format or downloaded to the user?s hard drive.

2011年4月16日星期六

Parallel Processing - introduction for R users

here a good parallel processing introduction for R users

http://www.stat.umn.edu/~charlie/parallel/

I found the batch processing section in this introduction is really valuable for me.

I just copy/paste some lines here:
(1)
This is really old stu (from 1975). But not everyone knows it.
If you do the following at a unix prompt
nohup nice -n 19 some job &
where \some job" is replaced by an actual job, then
the job will run in background (because of &).
the job will not be killed when you log out (because of nohup).
the job will have low priority (because of nice -n 19).
(2)
For example, if foo.R is a plain text le containing R commands,
then
nohup nice -n 19 R CMD BATCH --vanilla foo.R &
executes the commands and puts the printout in the le foo.Rout.
And
nohup nice -n 19 R CMD BATCH --no-restore foo.R &
executes the commands, puts the printout in the le foo.Rout,
and saves all created R objects in the le .RData.
(3)
nohup nice -n 19 R CMD BATCH foo.R &
is a really bad idea! It reads in all the objects in the le .RData (if
one is present) at the beginning. So you have no idea whether
the results are reproducible.
Always use --vanilla or --no-restore except when debugging.
(4)
This idiom has nothing to do with R. If foo is a compiled C or
C++ or Fortran main program that doesn't have command line
arguments (or a shell, Perl, Python, or Ruby script), then
nohup nice -n 19 foo &
runs it. And
nohup nice -n 19 foo < foo.in > foo.out &
runs it taking input from the le foo.in and placing output in the
le foo.out. Regular output and error messages are interspersed
and not necessarily in order.
nohup nice -n 19 foo < foo.in > foo.out 2> foo.err &
puts the error messages in a separate fi le.
(5)
Don't omit the nice -n 19. If you omit it, and we notice it, you'll
be in trouble. Or if we got up on the wrong side of bed that
morning, we'll just kill your jobs.
(6)
We've got lots of computers, and each one has eight processors
(so eight jobs can run simultaneously).
That allows a lot of parallel processing without knowing anything
more than how to background a job.

circos - Hive Plots for genomics

I have ever pointed it out in this blog, circos.

http://mkweb.bcgsc.ca/circos/

Circos is a software package for visualizing data and information. It visualizes data in a circular layout — this makes Circos ideal for exploring relationships between objects or positions. There are other reasons why a circular layout is advantageous, not the least being the fact that it is attractive.

Circos is ideal for creating publication-quality infographics and illustrations with a high data-to-ink ratio, richly layered data and pleasant symmetries. You have fine control each element in the figure to tailor its focus points and detail to your audience.

Hive Plots - Rational Network Visualization — Farewell to hairballs

http://mkweb.bcgsc.ca/linnet/

The hive plot is a rational visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes based on network structural properties. Edges are drawn as curved links. Simple and interpretable.

The purpose of the hive plot is to establish a new baseline for network visualization. Unlike hairballs, hive plots are excellent at managing the visual complexity arising from large number of edges and exposing both trends and outlier patterns in network structure.
Summary

Network visualizations are notoriously difficult to interpret. Their canonical representation in a visual form is the so-called hairball, which can be accidentally informative, but cannot be relied upon to consistently reveal meaningful patterns.

VIZBI 2011

As biological data grows rapidly in volume and complexity, biologists rely increasingly on computational visualization to gain insight from data. VIZBI 2011 will bring together researchers developing and using computational tools to visualize data from genomes, transcripts, proteins, cells, organisms, and populations.

The VIZBI 2011 workshop will be held March 16-18 at the Broad Institute, Cambridge MA, USA. The workshop will review the state of the art and highlight current and future challenges in visualization across this broad range of biological research areas.

In addition to the workshop, there will be a VIZBI Art & Biology Evening on Thursday, March 17, and a tutorial day on Saturday 19 March 2011.

http://vizbi.org/2011/Posters/Collection/

iEvoBio

iEvoBio aims to be a forum bringing together biologists working in evolution, systematics, and biodiversity, with software developers, and mathematicians. The goal of iEvoBio is both to catalyse the development of new tools, and to increase awareness of the possibilities offered by existing technologies (ranging from standards and reusable toolkits to mega-scale data analysis to rich visualization). The meeting extends over two full days and features traditional elements, including a keynote presentation at the beginning of each day and contributed talks, as well as more dynamic and interactive elements, such as a challenge, lightning talk-style sessions, a software bazaar, and Birds-of-a-Feather gatherings.

http://ievobio.org/

Plotting images on a grid

http://stackoverflow.com/questions/4860417/plotting-images-on-a-grid

2011年4月14日星期四

checking the seed files of BSgenome data packages forged by the Bioconductor project

To make a BSgenome data package, you need to prepare seed file. So, checking the seed files of others may be to right way for you, if you have no idea about it.

Here some codes from manu of BSgenome package can help you to check the seed files used for the package forged by the Bioconductor project.

#############################################################################
# check the seed files from GentlemanLab
> library(BSgenome)
Loading required package: IRanges

Attaching package: 'IRanges'

The following object(s) are masked from 'package:base':

cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
pmin, pmin.int, rbind, rep.int, setdiff, table, union

Loading required package: GenomicRanges
Loading required package: Biostrings
> seed_files<-system.file("extdata", "GentlemanLab", package="BSgenome") > list.files(seed_files, pattern="-seed$")
[1] "BSgenome.Amellifera.BeeBase.assembly4-seed"
[2] "BSgenome.Amellifera.UCSC.apiMel2-seed"
[3] "BSgenome.Athaliana.TAIR.01222004-seed"
[4] "BSgenome.Athaliana.TAIR.04232008-seed"
[5] "BSgenome.Athaliana.TAIR.TAIR9-seed"
[6] "BSgenome.Btaurus.UCSC.bosTau3-seed"
[7] "BSgenome.Btaurus.UCSC.bosTau4-seed"
[8] "BSgenome.Celegans.UCSC.ce2-seed"
[9] "BSgenome.Celegans.UCSC.ce6-seed"
[10] "BSgenome.Cfamiliaris.UCSC.canFam2-seed"
[11] "BSgenome.Dmelanogaster.UCSC.dm2-seed"
[12] "BSgenome.Dmelanogaster.UCSC.dm3-seed"
[13] "BSgenome.Drerio.UCSC.danRer5-seed"
[14] "BSgenome.Drerio.UCSC.danRer6-seed"
[15] "BSgenome.Drerio.UCSC.danRer7-seed"
[16] "BSgenome.Ecoli.NCBI.20080805-seed"
[17] "BSgenome.Gaculeatus.UCSC.gasAcu1-seed"
[18] "BSgenome.Ggallus.UCSC.galGal3-seed"
[19] "BSgenome.Hsapiens.UCSC.hg17-seed"
[20] "BSgenome.Hsapiens.UCSC.hg18-seed"
[21] "BSgenome.Hsapiens.UCSC.hg19-seed"
[22] "BSgenome.influenza.NCBI.20100628-seed"
[23] "BSgenome.Mmusculus.UCSC.mm8-seed"
[24] "BSgenome.Mmusculus.UCSC.mm9-seed"
[25] "BSgenome.Ptroglodytes.UCSC.panTro2-seed"
[26] "BSgenome.Rnorvegicus.UCSC.rn4-seed"
[27] "BSgenome.Scerevisiae.UCSC.sacCer1-seed"
[28] "BSgenome.Scerevisiae.UCSC.sacCer2-seed"
> seed_files
[1] "/ebio/abt6/jmao/bin/R-2.13/Rpacks/BSgenome/extdata/GentlemanLab"
> rn4_seed<-list.files(seed_files, pattern="rn4", full.names=TRUE) > cat(readLines(rn4_seed), sep="\n")
Package: BSgenome.Rnorvegicus.UCSC.rn4
Title: Rattus norvegicus (Rat) full genome (UCSC version rn4)
Description: Rattus norvegicus (Rat) full genome as provided by UCSC (rn4, Nov. 2004) and stored in Biostrings objects.
Version: 1.3.17
organism: Rattus norvegicus
species: Rat
provider: UCSC
provider_version: rn4
release_date: Nov. 2004
release_name: Baylor College of Medicine HGSC v3.4
source_url: http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/
organism_biocview: Rattus_norvegicus
BSgenomeObjname: Rnorvegicus
seqnames: paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "_random", sep="")), sep="")
circ_seqs: "chrM"
mseqnames: paste("upstream", c("1000", "2000", "5000"), sep="")
nmask_per_seq: 4
SrcDataFiles1: sequences: chromFa.tar.gz, upstream1000.fa.gz, upstream2000.fa.gz, upstream5000.fa.gz
from http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/
SrcDataFiles2: AGAPS masks: all the chr*_gap.txt.gz files from ftp://hgdownload.cse.ucsc.edu/goldenPath/rn4/database/
RM masks: http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromOut.tar.gz
TRF masks: http://hgdownload.cse.ucsc.edu/goldenPath/rn4/bigZips/chromTrf.tar.gz
PkgExamples: Rnorvegicus
seqlengths(Rnorvegicus)
Rnorvegicus$chr1 # same as Rnorvegicus[["chr1"]]
seqs_srcdir: /home/hpages/BSgenomeForge/srcdata/BSgenome.Rnorvegicus.UCSC.rn4/seqs
masks_srcdir: /home/hpages/BSgenomeForge/srcdata/BSgenome.Rnorvegicus.UCSC.rn4/masks
>

2011年4月13日星期三

spatial bayesian modeling - Andrew Finley

see him for spBayes package, and spatial Bayesian modeling.

http://blue.for.msu.edu/index.html

there are tutorials:
http://blue.for.msu.edu/courses.html

spatially-varying coefficients models for analysis of ecological data

Comparing spatially-varying coefficients models for analysis of ecological data with non-stationary and anisotropic residual dependence

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00060.x/abstract

Summary

1. When exploring spatially complex ecological phenomena using regression models it is often unreasonable to assume a single set of regression coefficients can capture space-varying and scale-dependent relationships between covariates and the outcome variable. This is especially true when conducting analysis across large spatial domains, where there is an increased propensity for anisotropic dependence structures and non-stationarity in the underlying spatial processes.

2. Geographically weighted regression (GWR) and Bayesian spatially-varying coefficients (SVC) are the most common methods for modelling such data. This paper compares these methods for modelling data generated from non-stationary processes. The comparison highlights some strengths and limitations of each method and aims to assist those who seek appropriate methods to better understand spatially complex ecological systems. Both synthetic and ecological data sets are used to facilitate the comparison.

3. Results underscored the need for the postulated model to approximate the underlying mechanism generating the data. Further, results show GWR and SVC can produce very different regression coefficient surfaces and hence dramatically different conclusions can be drawn regarding the impact of covariates. The trade-off between the richer inferential framework of SVC models and computational demands is also discussed.

2011年4月12日星期二

picking a subset of SNPs

picking a subset of SNPs for PCA, structure ...

http://gettinggeneticsdone.blogspot.com/2011/03/prune-gwas-data-in-rstats.html

next-generation sequencing applications have been published in Bioinformatics

Bioinformatics for Next Generation Sequencing' virtual issue

http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

Data hand tools

Data hand tools
A data task illustrates the importance of simple and flexible tools.
http://radar.oreilly.com/2011/04/data-hand-tools.html

A good guide for data analysis.

2011年4月11日星期一

PRGmatic: an efficient pipeline for collating genome-enriched second-generation sequencing data using a ‘provisional-reference genome’

Second-generation sequencing is increasingly being used in combination with genome-enrichment techniques to amplify a large number of loci in many individuals for the purpose of population genetic and phylogeographic analysis. Compiling all the necessary tools to analyse these data is complex and time-consuming. Here, we assemble a set of programs and pipe them together with Perl, enabling research laboratories without a dedicated bioinformatician to utilize second-generation sequencing. User input is a folder of the second-generation sequencing reads sorted by individual (in FASTA format) and pipeline output is a folder of multi-FASTA files that correspond to loci (with 2 alleles called per individual). Additional output includes a summary file of the number of individuals per locus, observed and expected heterozygosity for each locus, distribution of multiple hits and summary statistics (θ, Tajima’s D, etc.). This user-friendly, open source pipeline, which requires no a priori reference genome because it constructs its own, allows the user to set various parameters (e.g. minimum coverage) in the dependent programs (CAP3, BWA, SAMtools and VarScan) and facilitates evaluation of the nature and quality of data collected prior to analysis in software packages.

2011年4月10日星期日

how to get the list of genes involved in a biological process

http://biostar.stackexchange.com/questions/7323/how-to-get-the-list-of-genes-involved-in-a-biological-process

how to compare metabolic pathways

http://biostar.stackexchange.com/questions/7403/how-to-compare-metabolic-pathways

here is a good guide on comparing metabolic pathways

The Roots of Bioinformatics in Theoretical Biology

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002021

From the late 1980s onward, the term “bioinformatics” mostly has been used to refer to computational methods for comparative analysis of genome data. However, the term was originally more widely defined as the study of informatic processes in biotic systems. In this essay, I will trace this early history (from a personal point of view) and I will argue that the original meaning of the term is re-emerging.

2011年4月8日星期五

link your poster with barcode

http://blog.postersession.com/2011/03/29/qr-codes-on-a-research-poster/

Devil in the details - discussion from Nature on reproducible research

I agree with put forward the reproducible research, at least for the data-focused study. And also, at least the raw data supporting the results should be open reachable.

http://www.nature.com/nature/journal/v470/n7334/full/470305b.html

2011年4月7日星期四

VAT - annotation on genomic variants with function and frequency

The Variant Annotation Tool (VAT) consists of a set of modules to annotate genetic variants including single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and structural variants (SVs). This software package also contains a program to aggregate these variants at the gene level. Subsequently, an image is generated for each gene to visualize the functional impact of the annotated variants. This information can then be viewed and shared using a web-interface. In addition to annotation of the coding variants, this tool also integrates allele frequencies and genotype data providing population-specific information from published high quality variation databases such as 1000 Genomes Project

working on VCF file format.

http://info.gersteinlab.org/VAT

for loop in R

R Language Definition
http://cran.r-project.org/doc/manuals/R-lang.html


The syntax of the for loop is

for ( name in vector )
statement1

where vector can be either a vector or a list. For each element in vector the variable name is set to the value of that element and statement1 is evaluated. A side effect is that the variable name still exists after the loop has concluded and it has the value of the last element of vector that the loop was evaluated for.

################################################################
# an example

you have 60 DNAcopy object you need to plot. The
following code will do the trick:

The first step is to create a vector with the name of the 60 objects:

object.list <- c("DNAcopy1", "DNAcopy2"..."DNAcopy60")

than do a loop to plot them as pdf file:

for (I in object.list) {
pdf(file=paste(I, ".pdf", sep=""), height=10, width=10)
plot(I, plot.type="w")
dev.off()
}

discard rows of one file by comparing between two files in the first two columns

(1) the question
For example, I have two , delimited files (three columns each):

file_1
1,2,A
1,3,A
1,4,t
1,5,A
2,3,c
2,7,A
2,9,g
3,1,A
3,5,h
3,7,A

file_2
1,1,c
1,3,A
1,4,m
1,5,A
2,6,u
2,7,A
2,9,p
3,1,A
3,5,i
3,7,A


I want to discard the rows of file_1, these rows have the same records
in the two columns. After discarding, I have

1,2,A
2,3,c

(2) the solution

A.
all easier methods will assume the files are sorted ...

I would tempted to make the second comma a different delimiter so the
first two columns can become a single 'key'

sed 's/,/|/g2' file_1 > file_11
cut -f1 -d \| file_11 > file_1key
cut -f1,2 -d ',' file_2 > file_2key
comm -23 file_1key file_2key > file_3key
join -t \| file_11 file_3key

B.

# extract 'keys' of files 'a' and 'b' (keys = columns 1 and 2). Place
keys in new files
$ cut -d "," -f 1,2 a > a12
$ cut -d "," -f 1,2 b > b12

# treat file 'b12' as a list of patterns to use with grep. Ask grep to
show lines that *don't* match file a12. Store these unique keys in a
new file:
$ grep -vf b12 a12 > keys_a

# finally use this new file as a set of patterns for grep to extract
the equivalent lines from file 'a'
$ grep -f keys_a a
1,2,A
2,3,c

Learn to use the Galaxy resource with this free tutorial

Galaxy is an excellent online genome analysis tool that combines the power of existing genome annotation databases with a simple web portal with a variety of tools and algorithms, to enable users to search remote resources, combine data from independent queries, prepare, manipulate and analyze the data using a large suite of analysis tools. A history workflow is created for every analysis providing a record ensuring reproducibility of results, and the opportunity to share workflows with other Galaxy users

http://www.openhelix.com//cgi/tutorialInfo.cgi?id=82

2011年4月6日星期三

mmap - Map Pages of Memory

mmap: Map Pages of Memory

R interface to POSIX mmap and Window's MapViewOfFile

ff - R package for memory-efficient storage of large data on disk and fast access functions

(1) here for more example
http://ff.r-forge.r-project.org/

(2) abstract
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory - the effective virtual memory consumption per ff object. ff supports R's standard atomic data types 'double', 'logical', 'raw' and 'integer' and non-standard atomic types boolean (1 bit), quad (2 bit unsigned), nibble (4 bit unsigned), byte (1 byte signed with NAs), ubyte (1 byte unsigned), short (2 byte signed with NAs), ushort (2 byte unsigned), single (4 byte float with NAs). For example 'quad' allows efficient storage of genomic data as an 'A','T','G','C' factor. The unsigned types support 'circular' arithmetic. There is also support for close-to-atomic types 'factor', 'ordered', 'POSIXct', 'Date' and custom close-to-atomic types. ff not only has native C-support for vectors, matrices and arrays with flexible dimorder (major column-order, major row-order and generalizations for arrays). There is also a ffdf class not unlike data.frames and import/export filters for csv files. ff objects store raw data in binary flat files in native encoding, and complement this with metadata stored in R as physical and virtual attributes. ff objects have well-defined hybrid copying semantics, which gives rise to certain performance improvements through virtualization. ff objects can be stored and reopened across R sessions. ff files can be shared by multiple ff R objects (using different data en/de-coding schemes) in the same process or from multiple R processes to exploit parallelism. A wide choice of finalizer options allows to work with 'permanent' files as well as creating/removing 'temporary' ff files completely transparent to the user. On certain OS/Filesystem combinations, creating the ff files works without notable delay thanks to using sparse file allocation. Several access optimization techniques such as Hybrid Index Preprocessing and Virtualization are implemented to achieve good performance even with large datasets, for example virtual matrix transpose without touching a single byte on disk. Further, to reduce disk I/O, 'logicals' and non-standard data types get stored native and compact on binary flat files i.e. logicals take up exactly 2 bits to represent TRUE, FALSE and NA. Beyond basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects and support for batch processing on ff objects (e.g. as.ram, as.ff, ffapply). ff interfaces closely with functionality from package 'bit': chunked looping, fast bit operations and coercions between different objects that can store subscript information ('bit', 'bitwhich', ff 'boolean', ri range index, hi hybrid index). This allows to work interactively with selections of large datasets and quickly modify selection criteria. Further high-performance enhancements can be made available upon request.

big data in R

here a blog touched much about memory usage in R:

http://biostatmatt.com/weblog/page/4

english-communication-for-scientists - from Scitable

Scitable by Nature education

English Communication for Scientists

http://www.nature.com/scitable/ebooks/english-communication-for-scientists-14053993/contents

AFLPdat - a collection of R functions which facilitates the handling of dominant genotypic data

http://www.nhm.uio.no/english/research/ncb/aflpdat/

PGDSpider - A program for converting data between population genetics programs

This is really a good tools for population genetic studies. Format your input files

PGDSpider

http://www.cmpg.unibe.ch/software/PGDSpider/

Linux / Unix Command: ftp

Linux / Unix Command: ftp
Command Library
NAME
ftp - Internet file transfer program
SYNOPSIS
ftp [-pinegvd ] [host ]
pftp [-inegvd ] [host ]
EXAMPLES
SEE ALSO
rcp(1), scp(1), cp(1), ftpd(8),
DESCRIPTION
Ftp is the user interface to the Internet standard File Transfer Protocol. The program allows a user to transfer files to and from a remote network site.

Options may be specified at the command line, or to the command interpreter.

It works like ssh for remote accessing a server.

put - write local file to ftp
get - cp to local computer
quit - leave ftp

2011年4月5日星期二

Easyfig: a genome comparison visualizer

http://bioinformatics.oxfordjournals.org/content/27/7/1009.short?rss=1

http://easyfig.sourceforge.net/

Rklm

http://www.omegahat.org/RKML/GoogleEarth/

An introduction to R: software for statistical modelling and computing, course notes

http://www.csiro.au/resources/Rcoursenotes.html

Analysing spatial point patterns in R

http://www.csiro.au/resources/Spatial-Point-Patterns-in-R.html

R and Google Earth

http://www.omegahat.org/GoogleEarth/CityTemperatures/

example and codes for R and Google Earth.

circos - tools for circular visualization

a software package for visualizing data and information.

Introduction:
http://mkweb.bcgsc.ca/circos/

Tutorial:
http://mkweb.bcgsc.ca/circos/tutorials/lessons/

Evolutionary Systems Biology Lab

Jaume Bertranpetit is the leader of Evolutionary Systems Biology Lab

http://www.ibe.upf-csic.es/ibe/research/research-groups/bertranpetit.html

They have done and are doing many excellent work on human genetics/genomics.

Here is a blog of a member of this lab:
http://bioinfoblog.it/

pajek - network plot

pajek wiki:

http://pajek.imfm.si/doku.php?id=pajek

2011年4月4日星期一

Interactome Networks and its importance

Pascal Braun
http://ccsb.dfci.harvard.edu/web/www/ccsb/publications/2011_papers.html

We recently completed mapping of the first binary interactome network for the reference plant Arabidopsis thaliana. Using tools of graph theory we identify biologically relevant network communities from which a picture of the overall interactome network organization starts to emerge. Combination of interaction and comparative genomics data yielded insights into network evolution, and biological inspection resulted in many hypotheses for unknown proteins and revealed unexpected connectivity between previously studied components of phytohormone signaling pathways.

Interactome Networks and Human Disease
http://www.cell.com/abstract/S0092-8674%2811%2900130-9
Summary

Complex biological systems and cellular networks may underlie most genotype to phenotype relationships. Here, we review basic concepts in network biology, discussing different types of interactome networks and the insights that can come from analyzing them. We elaborate on why interactome networks are important to consider in biology, how they can be mapped and integrated with each other, what global properties are starting to emerge from interactome network models, and how these properties may relate to human disease.

lingua franca for biology ? , et al.

de novo, a Latin expression meaning "from the beginning," "afresh," "anew," "beginning again."
ex vivo, (Latin: "out of the living") means that which takes place outside an organism.
lingua franca, a language used for communication between speakers of different languages
in papyro, referring to experiments or studies carried out only on paper
in silico, an expression used to mean "performed on computer or via computer simulation.
in situ, a Latin phrase which translated literally as 'In position'
in utero, a Latin term literally meaning "in the uterus". In biology, the phrase describes the state of an embryo or fetus.
in vitro, (Latin: within glass) is performed not in a living organism but in a controlled environment, such as in a test tube or Petri dish
in vivo, (Latin for "within the living") is experimentation using a whole, living organism as opposed to a partial or dead organism

EIGENSTRAT and EIGENSOFT

The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes.

http://genepath.med.harvard.edu/~reich/Software.htm

2011年4月3日星期日

The wild Evolution Group

A strong team in Evolutionary research

http://wildevolution.biology.ed.ac.uk/index.html

Some topics we are currently interested in are:
- What determines variance in fitness between individuals?
- How do life-history trade-offs contribute to phenotypic diversity?
- What are the genetic and environmental determinants of senescence?
- How is climate change affecting evolutionary and ecological processes?
- Can genetic effects be sexually-antagonistic?

WEG is part of the Institute of Evolutionary Biology in the University of Edinburgh.


Jarrod Hadfield
Loeske Kruuk
Sue Lewis
Dan Nussey
Josephine Pemberton
Alastair Wilson

2011年4月1日星期五

unix for windows computer

you may have to start thinking about alternatives such as

(1) cygwin (creates a unix environment on top of your windows machine),

(2) virtualbox (allows running multiple OS in parallel),

(3) wubi (allows easy installation of dual-booting ubuntu on a windows machine), or

(4) directly installing a linux distro on your machine.

another guide to operation of vcf files

http://helix.nih.gov/Applications/vcftools.html

sqldf and Rsqlite

http://code.google.com/p/sqldf/

access and query sequences with R

(1) How do I access and query entire genome sequences with R

http://biostar.stackexchange.com/questions/357/how-do-i-access-and-query-entire-genome-sequences-with-r

(2) seqinr
http://seqinr.r-forge.r-project.org/

Dolph Schluter - ecological speciation using stickleback (fish) as systems

I need to follow him on ecological speciation.

http://www.zoology.ubc.ca/~schluter/


Here are good class stuff from the Lab:

Biology 418 - Evolutionary Ecology
The course presents an overview of current knowledge and modern research into evolutionary processes acting on contemporary populations; the ecological basis of adaptation; and the consequences of natural selection for population and community dynamics and evolution. Three approaches to the study of evolutionary ecology are introduced: predictive and optimization models; the comparative method; and direct measurement of natural selection in the wild.
web site

Biology 548b - Quantitative methods in Ecology & Evolution
Biology 548b is a graduate course on quantitative methods for data analysis in ecology and evolution. The format is a mixture of lectures/discussions on methodological topics and practical workshops using the R package. Topics include graphics, experimental design, statistical model fitting, model selection, computer-intensive methods, meta-analysis, multivariate and phylogenetic comparative methods. Graduate students are assumed to have taken an introductory undergraduate statistics course at some point in their careers. We begin at a fairly basic level using a general linear model approach.
web site

R Tutorials

Many nice guides of R like ANOVA, here:

http://rtutorialseries.blogspot.com/