2012年7月31日星期二

Batch R jobs - examples

http://www.math.ncu.edu.tw/~chenwc/R_note/index.php?item=batch

Batch jobs

There are two examples for batching jobs in the page. The first one will demonstrate different batching method, and the second will demonstrate a batching method with extra arguments. Some of Linux knowledges are required for understanding these commands, and they are also useful when you have to process bunch of jobs. More ways to batch jobs can be found at the page "Batch more" in this web site and the page " Rscript" in the HPSC web site.

Example 1:

For MS Windows user
Use Rterm.exe or Rcmd.exe to substitute R in the following command mode.
Hello world !
First, create an R code file "hello_1.r" contains the following.
# File name: hello_1.r a <- c("Hello", "world", "!") print(a) b <- paste(a, collapse = " ") print(b)
Here, it sets an array variable "a" contains 3 elements, "Hello", "world", and "!", and prints them. Italso uses the function "paste" to join all elements in array "a" by " " (space) and then it sets the arrayto a variable "b" and prints it.
Batch From Script
Use the command mode to send "hello_1.r" as a batch job.

SHELL> R CMD BATCH --vanilla --slave hello_1.r output_file

or

SHELL> nohup R CMD BATCH --vanilla --slave hello_1.r output_file &

By default, the output will be saved in "hello_1.r.Rout" if an "output_file" isn't assigned. The output file (hello_1.r.Rout) will as the following.
> a <- c("Hello", "world", "!") > print(a) [1] "Hello" "world" "!" > b <- paste(a, collapse = " ") > print(b) [1] "Hello world !" >

Batch From Command
Use the command mode to send "hello_1.r" as a batch job from [STDIN] and show the output on [SDTOUT]. The followings have similar commands, and can provide same results, but they can be used in different purposes.

	`SHELL> R --vanilla --slave < hello_1.r > output_file`
	or
	`SHELL> cat hello_1.r \| R --vanilla --slave > output_file`
	or
	`SHELL> echo 'source("hello_1.r")' \| R --vanilla --slave > output_file`

By default, the output will be show on the screen, you can use ">" to redirect the output into "output_file". The output will like this.

[1] "Hello" "world" "!"
[1] "Hello world !"

Example 2:

Hello world, MOLAS !
First, create an R code file "hello_2.r" contains the following.
# File name: hello_2.r a <- c("Hello", "world,", argv, "!") print(a) b <- paste(a, collapse = " ") print(b)
Here, it sets an array variable "a" contains 4 elements, "Hello", "world,", argv, and "!", and printsthem. The variable "argv" will be passed from STDIN command. It also uses the function "paste" tojoin all elements in array "a" by " " (space) and sets to a variable "b" and prints it.
Pass Variable From STDIN
Use command mode to send "hello_2.r" for batch job from STDIN and show the output on SDTOUT. The variable "argv" as "MOLAS" is assigned before the "hello_2.r" is passed to R.

SHELL> echo 'argv <- "MOLAS"; source("hello_2.r")' | R --vanilla --slave > output_file

By default, the output will be shown on the screen, you can use ">" to redirect the output into an"output_file". The output will like this.
[1] "Hello" "world," "MOLAS" "!" [1] "Hello world, MOLAS !"

Background process in Linux
echo, nohup, &, |, <, >

NGS QC Toolkit

NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data

Next generation sequencing (NGS) technologies provide ahigh-throughput means to generate large amount ofsequence data. However, quality control (QC) of sequencedata generated from these technologies is extremelyimportant for meaningful downstream analysis. Further,highly efficient and fast processing tools are required tohandle the large volume of datasets. Here, we havedeveloped an application, NGS QC Toolkit, for quality checkand filtering of high-quality data. This toolkit is a standaloneand open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.

choose computer for NGS analysis

1. http://www.biostars.org/post/show/5664/hardware-needed-to-analyse-microarray-data-with-rbioconductor/

Many analysis procedures are memory hungry, 8GB of RAM is inadequate. Go for 64GB or 96GB.

The speed of individual CPU cores is less important than the number of them. Thus it is far better to buy a configurationwith more and somewhat slower processors than fewer but the fastest ones.

I've done analysis of Affy exon arrays (using XPS). You do not need a monster computer to do this analysis. What you doneed is a big pile of RAM and adequate hard drive space. For example, you could buy an ordinary box (e.g. a DellPrecision workstation). Buy 16 gigs of ram from NewEgg, put in two 2-terrabyte hard drives, and load a 64-bit linux suchas Ubuntu. This machine would also be perfectly adequate for many ordinary sorts of microarray analysis.

If you find yourselves doing a lot more complex bioinformatics analysis you will need multiple machines, but at that pointthe best bet is to use someone else's hardware-- either a local compute cluster or cloud machines. However, it soundslike you're not there yet, and there's no reason to spend $10,000+ on a machine at this point.

2. http://www.biostars.org/post/show/43240/hardware-benchmarking-tasks-for-a-high-performance-bioinformatics-cluster/

3. http://www.biostars.org/post/show/2604/hardware-suitable-for-generic-nextgen-sequencing-processing/

Okay, well then I'll go ahead and throw some info out there in the hopes that it's useful to you.

What I can tell you is that the cluster we share time on has 8-core machines with 16GB of RAM each and they'resufficient for most of our needs. We don't do much assembly, but we do do a ton of other genomic processing, rangingfrom mapping short reads all the way up to snp calling and pathway inference. I also still do a fair amount of arrayprocessing.

Using most cluster management tools, (PBS, LSF, whatever), it should be possible to allow a user to reserve more thanone CPU per node, effectively giving them up to 16 GB for a process if they reserve the whole node. Yeah, that meanssome lost cycles, but I don't seem to use it that often - 2GB is still sufficient for most things I run. It'd also be good toset up a handful of machines with a whole lot of RAM - maybe 64GB? That gives users who are doing things likeassembly or loading huge networks into RAM some options.

I more often run into limits on I/O. Giving each machine a reasonably sized scratch disc and encouraging your users tomake smart use of it is a good idea. Network filesystems can be bogged down really quickly when a few dozen nodesare all reading and writing data. If you're going to be doing lots of really I/O intensive stuff (and dealing with short reads,you probably will be), it's probably worth looking into faster hard drives. Certainly 7200RPM, if not 10k. Last time I looked15k drives were available, but not worth it in terms of price/performance. That may have changed.

I won't get into super-detail on the specs - you'll have to price that out and see where the sweet spot is. I also won't tellyou how many nodes to get, because again, that depends on your funding. I will say that if you're talking a small clusterfor a small lab, it may make sense to just get 3 or 4 machines with 32 cores and a bunch of RAM, and not worry abouttrying to set up a shared filesystem, queue, etc - it really can be a headache to maintain. If you'll be supporting a largeruserbase, though, then you may find a better price point at less CPUs per node, and have potentially fewer problemswith disk I/O (because you'll have less CPUs per HD).

People who know more about cluster maintenance and hardware than I do, feel free to chime in with additions or corrections.

2012年7月26日星期四

learning perl - resources

1. The Perlmonks have a large tutorial section.

2. This is a great introduction to Perl:

http://sofiarobb.com/learning-perl-toc/

3. Beginning Perl

4. Beginning Perl for Bioinformatics

Annotation of genes on new genome

Emerging fungal threats to animal, plant and ecosystem health

对，，起作用， - account for, be responsible for

Although low population densities may account for the detection of some lineages only within relatively small regions of Africa, other factors may be responsible for the minimal degrees of movement observed for fairly common lineages, such as MSV-A4V (zones 13 and 14 in Fig. 2M), corroborating the commonly observed localization of MSV-A4 genomes to southern Africa (39, 69).

variationtoolkit - good tools for NGS operations

http://code.google.com/p/variationtoolkit/

it works on general format of BAM, VCF et al. for many general objectives.

an instructive shell scripts - filtering FASTA sequences

filtering FASTA sequences by chromosomes names from a big fasta file

You can do the following:

chr=$(seq 1 1 22)" X Y"
for i in $chr ; do 
    grep ">" input.fa | grep "|$i|" | awk 'BEGIN {FS=">"} {print $2}' > selection_$i
    samtools faidx input.fa `cat selection_$i` > output_$i.fa
done

BioAwk - fasta, fastq, SAM, BED, GFF aware awk programming

http://www.biostars.org/post/show/47751/bioawk-fasta-fastq-sam-bed-gff-aware-awk-programming/

Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.

Code

The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.

Documentation

There is a a short manual page in the main distribution and a longer HTML formatted help page

Examples

Extract unmapped reads without header:

awk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:

awk -c sam -H '!and($flag,4)'

Reverse complement FASTA:

awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz

Create FASTA from SAM (uses revcomp if FLAG & 16)::

samtools view aln.bam | \
    awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'

Get the %GC from FASTA:

awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz

Get the mean Phred quality score from FASTQ:

awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz

Take column name from the first line (where "age" appears in the first line of input.txt):

awk -c header '{ print $age }' input.txt

2012年7月25日星期三

时间逝去 - elapse

Please do not contact our office again before 28 days have elapsed.

Three roads diverged? Routes to phylogeographic inference

inferring epidemiological processes from genetic data

The landscape genetics of infectious disease emergence and spread

The spread of parasites is inherently a spatial process often embedded in physically complex landscapes. It is therefore not surprising thatinfectious disease researchers are increasingly taking a landscape genetics perspective to elucidate mechanisms underlying basicecological processes driving infectious disease dynamics and to understand the linkage between spatially dependent populationprocesses and the geographic distribution of genetic variation within both hosts and parasites. The increasing availability of geneticinformation on hosts and parasites when coupled to their ecological interactions can lead to insights for predicting patterns of diseaseemergence, spread and control. Here, we review research progress in this area based on four different motivations for the application oflandscape genetics approaches: (i) assessing the spatial organization of genetic variation in parasites as a function of environmentalvariability, (ii) using host population genetic structure as a means to parameterize ecological dynamics that indirectly influence parasitepopulations, for example, gene flow and movement pathways across heterogeneous landscapes and the concurrent transport of infectiousagents, (iii) elucidating the temporal and spatial scales of disease processes and (iv) reconstructing and understanding infectious diseaseinvasion. Throughout this review, we emphasize that landscape genetic principles are relevant to infection dynamics across a range ofscales from within host dynamics to global geographic patterns and that they can also be applied to unconventional ‘landscapes’ such asheterogeneous contact networks underlying the spread of human and livestock diseases. We conclude by discussing some general considerations and problems for inferring epidemiological processes from genetic data and try to identify possible future directions andapplications for this rapidly expanding field.

订阅：博文 (Atom)

	`SHELL> R CMD BATCH --vanilla --slave hello_1.r output_file`
	or
	`SHELL> nohup R CMD BATCH --vanilla --slave hello_1.r output_file &`