2012年7月31日星期二

Batch R jobs - examples

http://www.math.ncu.edu.tw/~chenwc/R_note/index.php?item=batch

 Batch jobs

There are two examples for batching jobs in the page. The first one will demonstrate different batching method, and the second will demonstrate a batching method with extra arguments. Some of Linux knowledges are required for understanding these commands, and they are also useful when you have to process bunch of jobs. More ways to batch jobs can be found at the page "Batch more" in this web site and the page " Rscript" in the HPSC web site. 

Example 1:
  1. For MS Windows user
    Use Rterm.exe or Rcmd.exe to substitute R in the following command mode. 
  2. Hello world !
    First, create an R code file "hello_1.rcontains the following

        
    # File name: hello_1.r
    a <- c("Hello", "world", "!")
    print(a)
    b <- paste(a, collapse = " ")
    print(b)
    

    Hereit sets an array variable "acontains 3 elements, "Hello", "world", and "!", and prints themItalso uses the function "pasteto join all elements in array "aby " " (spaceand then it sets the arrayto a variable "band prints it
  3. Batch From Script
    Use the command mode to send "hello_1.r" as a batch job. 

        SHELL> R CMD BATCH --vanilla --slave hello_1.r output_file
    or
        SHELLnohup R CMD BATCH --vanilla --slave hello_1.r output_file &

    By default, the output will be saved in "hello_1.r.Rout" if an "output_file" isn't assigned. The output file (hello_1.r.Rout) will as the following. 

        
    > a <- c("Hello", "world", "!")
    > print(a)
    [1] "Hello" "world" "!"
    > b <- paste(a, collapse = " ")
    > print(b)
    [1] "Hello world !"
    >
    

  4. Batch From Command
    Use the command mode to send "hello_1.r" as a batch job from [STDIN] and show the output on [SDTOUT]. The followings have similar commands, and can provide same results, but they can be used in different purposes. 

        SHELL> R --vanilla --slave < hello_1.r > output_file
        or
        SHELLcat hello_1.r | R --vanilla --slave > output_file
        or
        SHELLecho 'source("hello_1.r")' | R --vanilla --slave > output_file

    By default, the output will be show on the screen, you can use ">" to redirect the output into "output_file". The output will like this. 

        
    [1] "Hello" "world" "!"
    [1] "Hello world !"
    


Example 2:
  1. Hello world, MOLAS !
    First, create an R code file "hello_2.r" contains the following. 

        
    # File name: hello_2.r
    a <- c("Hello", "world,", argv, "!")
    print(a)
    b <- paste(a, collapse = " ")
    print(b)
    

    Hereit sets an array variable "acontains 4 elements, "Hello", "world,", argvand "!", and printsthemThe variable "argvwill be passed from STDIN commandIt also uses the function "pastetojoin all elements in array "aby " " (spaceand sets to a variable "band prints it
  2. Pass Variable From STDIN
    Use command mode to send "hello_2.r" for batch job from STDIN and show the output on SDTOUT. The variable "argv" as "MOLAS" is assigned before the "hello_2.r" is passed to R. 

        SHELLecho 'argv <- "MOLAS"; source("hello_2.r")' | R --vanilla --slave > output_file

    By defaultthe output will be shown on the screenyou can use ">" to redirect the output into an"output_file". The output will like this

        
    [1] "Hello" "world," "MOLAS" "!"
    [1] "Hello world, MOLAS !"
    

  • Background process in Linux
    echo, nohup, &, |, <, > 

NGS QC Toolkit


NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data

Next generation sequencing (NGStechnologies provide ahigh-throughput means to generate large amount ofsequence dataHoweverquality control (QCof sequencedata generated from these technologies is extremelyimportant for meaningful downstream analysisFurther,highly efficient and fast processing tools are required tohandle the large volume of datasetsHerewe havedeveloped an applicationNGS QC Toolkitfor quality checkand filtering of high-quality dataThis toolkit is a standaloneand open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.htm​l. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.

choose computer for NGS analysis

1. http://www.biostars.org/post/show/5664/hardware-needed-to-analyse-microarray-data-with-rbioconductor/


Many analysis procedures are memory hungry, 8GB of RAM is inadequateGo for 64GB or 96GB.
The speed of individual CPU cores is less important than the number of themThus it is far better to buy a configurationwith more and somewhat slower processors than fewer but the fastest ones.
I've done analysis of Affy exon arrays (using XPS). You do not need a monster computer to do this analysisWhat you doneed is a big pile of RAM and adequate hard drive spaceFor exampleyou could buy an ordinary box (e.ga DellPrecision workstation). Buy 16 gigs of ram from NewEggput in two 2-terrabyte hard drivesand load a 64-bit linux suchas UbuntuThis machine would also be perfectly adequate for many ordinary sorts of microarray analysis.
If you find yourselves doing a lot more complex bioinformatics analysis you will need multiple machinesbut at that pointthe best bet is to use someone else's hardware-- either a local compute cluster or cloud machinesHoweverit soundslike you're not there yetand there's no reason to spend $10,000+ on a machine at this point.


2. http://www.biostars.org/post/show/43240/hardware-benchmarking-tasks-for-a-high-performance-bioinformatics-cluster/

3. http://www.biostars.org/post/show/2604/hardware-suitable-for-generic-nextgen-sequencing-processing/


Okay, well then I'll go ahead and throw some info out there in the hopes that it's useful to you.
What I can tell you is that the cluster we share time on has 8-core machines with 16GB of RAM each and they'resufficient for most of our needsWe don't do much assemblybut we do do a ton of other genomic processingrangingfrom mapping short reads all the way up to snp calling and pathway inferenceI also still do a fair amount of arrayprocessing.
Using most cluster management tools, (PBSLSFwhatever), it should be possible to allow a user to reserve more thanone CPU per nodeeffectively giving them up to 16 GB for a process if they reserve the whole nodeYeahthat meanssome lost cyclesbut I don't seem to use it that often - 2GB is still sufficient for most things I runIt'd also be good toset up a handful of machines with a whole lot of RAM - maybe 64GB? That gives users who are doing things likeassembly or loading huge networks into RAM some options.
I more often run into limits on I/OGiving each machine a reasonably sized scratch disc and encouraging your users tomake smart use of it is a good ideaNetwork filesystems can be bogged down really quickly when a few dozen nodesare all reading and writing dataIf you're going to be doing lots of really I/O intensive stuff (and dealing with short reads,you probably will be), it's probably worth looking into faster hard drivesCertainly 7200RPM, if not 10k. Last time I looked15k drives were availablebut not worth it in terms of price/performanceThat may have changed.
I won't get into super-detail on the specs - you'll have to price that out and see where the sweet spot isI also won't tellyou how many nodes to getbecause againthat depends on your fundingI will say that if you're talking a small clusterfor a small labit may make sense to just get 3 or 4 machines with 32 cores and a bunch of RAMand not worry abouttrying to set up a shared filesystemqueueetc - it really can be a headache to maintainIf you'll be supporting a largeruserbasethoughthen you may find a better price point at less CPUs per nodeand have potentially fewer problemswith disk I/O (because you'll have less CPUs per HD).
People who know more about cluster maintenance and hardware than I do, feel free to chime in with additions or corrections.

2012年7月26日星期四

learning perl - resources

1. The Perlmonks have a large tutorial section.

2. This is a great introduction to Perl:
3. Beginning Perl

4. Beginning Perl for Bioinformatics

Annotation of genes on new genome

Annotation of genes on new genome

Emerging fungal threats to animal, plant and ecosystem health


Emerging fungal threats to animalplant and ecosystem health

对,,起作用, - account for, be responsible for


Although low population densities may account for the detection of some lineages only within relatively small regions of Africa, other factors may be responsible for the minimal degrees of movement observed for fairly common lineages, such as MSV-A4V (zones 13 and 14 in Fig. 2M), corroborating the commonly observed localization of MSV-A4 genomes to southern Africa (39, 69).

variationtoolkit - good tools for NGS operations

http://code.google.com/p/variationtoolkit/

it works on general format of BAM, VCF et al. for many general objectives.

an instructive shell scripts - filtering FASTA sequences

filtering FASTA sequences by chromosomes names from a big fasta file


You can do the following:
chr=$(seq 1 1 22)" X Y"
for i in $chr ; do 
    grep ">" input.fa | grep "|$i|" | awk 'BEGIN {FS=">"} {print $2}' > selection_$i
    samtools faidx input.fa `cat selection_$i` > output_$i.fa
done

BioAwk - fasta, fastq, SAM, BED, GFF aware awk programming



Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.
Code
The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.
Documentation
There is a a short manual page in the main distribution and a longer HTML formatted help page
Examples
Extract unmapped reads without header:
awk -c sam 'and($flag,4)' aln.sam.gz
Extract mapped reads with header:
awk -c sam -H '!and($flag,4)'
Reverse complement FASTA:
awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz
Create FASTA from SAM (uses revcomp if FLAG & 16)::
samtools view aln.bam | \
    awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'
Get the %GC from FASTA:
awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz
Get the mean Phred quality score from FASTQ:
awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz
Take column name from the first line (where "age" appears in the first line of input.txt):
awk -c header '{ print $age }' input.txt

2012年7月25日星期三

时间逝去 - elapse

Please do not contact our office again before 28 days have elapsed.

Three roads diverged? Routes to phylogeographic inference


Three roads diverged? Routes to phylogeographic inference

inferring epidemiological processes from genetic data

The landscape genetics of infectious disease emergence and spread

The spread of parasites is inherently a spatial process often embedded in physically complex landscapesIt is therefore not surprising thatinfectious disease researchers are increasingly taking a landscape genetics perspective to elucidate mechanisms underlying basicecological processes driving infectious disease dynamics and to understand the linkage between spatially dependent populationprocesses and the geographic distribution of genetic variation within both hosts and parasitesThe increasing availability of geneticinformation on hosts and parasites when coupled to their ecological interactions can lead to insights for predicting patterns of diseaseemergencespread and controlHerewe review research progress in this area based on four different motivations for the application oflandscape genetics approaches: (iassessing the spatial organization of genetic variation in parasites as a function of environmentalvariability, (iiusing host population genetic structure as a means to parameterize ecological dynamics that indirectly influence parasitepopulationsfor examplegene flow and movement pathways across heterogeneous landscapes and the concurrent transport of infectiousagents, (iiielucidating the temporal and spatial scales of disease processes and (ivreconstructing and understanding infectious diseaseinvasionThroughout this reviewwe emphasize that landscape genetic principles are relevant to infection dynamics across a range ofscales from within host dynamics to global geographic patterns and that they can also be applied to unconventional ‘landscapes’ such asheterogeneous contact networks underlying the spread of human and livestock diseasesWe conclude by discussing some general considerations and problems for inferring epidemiological processes from genetic data and try to identify possible future directions andapplications for this rapidly expanding field.