2012年9月27日星期四

tips of NGS data processing

  1. http://crocea.mednet.ucla.edu/research/variation/next-gen-sequence-how-to
  2.  one referene genome in one fasta file
    1. FileFormatExchange.combineMultiFastaIntoOne() in variation.src.misc
  3. bwa index
    1. bwa index -a bwtsw hsRef.fa
      • "bwtsw" is for genome >2Gb
  4. move all alignment short reads into one directory
    1. for all subspecies
      • for j in aethiops cynosurus pygerythrus sabaeus tantalus; do echo $j; for i in `find /u/home/eeskin/namtran/Experiment/Genome/Freimer/Chlorocebus/INPUT/$j/ -name sequence*.tar.gz -print`; do echo $i; ln -s $i ./$j ; done; done
    2. for sequences stored in bam format on WUSTL server
      1. run shell/wgetGivenURL.sh to fetch sequence bam files from WUSTL
      2. run picard to convert bam to paired-end fastq
        • java -jar ~/script//vervet/bin/picard-tools/SamToFastq.jar INPUT=gerald_62FGFAAXX_3.bam F=gerald_62FGFAAXX_3_1.fastq F2=gerald_62FGFAAXX_3_2.fastq
  5. bwa alignment (aln or bwtsw; single-end or paired-end)
    1. vervetdb.src.MpiBWA.py including these steps
      1. align
        1. input fastq files can be in plain text or gzipped.
        2. NOTE: For bwtsw, one read could be found in >1 alignments in final bam file. i.e. 212904 unique reads among 217507 mapped reads, max redundant read count=4. ~2% are found to be in >1 alignments.
      2. convert sam to bam
        1. samtools view -F 4 -Sbh alignment.sam > alignment.bam
          • "-F 4" instructs samtools to remove unmapped reads
    2. merge all bam files into one bam (have to be sorted if "bwa sort" is used)
      1. perl -e 'print "\@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n\@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
        • This is to tag the 454 and illumina results by read group (RG). Usually not necessary.
      2. samtools merge -rh rg.txt - *.bam | samtools rmdup - - | samtools rmdup -s - aln.bam
        • duplicates removal for single-end reads is not recommended given very deep coverage as this step may discard too many good alignments.
        • "-rh rg.txt" is again to add RG header into the final bam file. Without it, the command is: samtools merge merged.bam  *.bam
      3. samtools merge - *.bam | samtools view -bh -F 4 - -o aln.bam
        • merge all bam files and remove unaligned reads
        • WATCH "-" in "samtools merge". It instructs "merge" to output to stdout.
        • If read group (RG) tags are present in original bam files, samtools merge only takes RG info in the header from the first bam file.
      4. picard merge would combine headers from all bam files, rather than the header from the first bam.
        • java -jar ~/script//vervet/bin/picard-tools/MergeSamFiles.jar INPUT=ref_illu_vs_1MbBAC_by_aln.RG.bam INPUT=454_vs_ref_1MbBAC.F4.RG.bam INPUT=cynosurus_vs_1MbBAC_by_aln.RG.bam USE_THREADING=true ASSUME_SORTED=true OUTPUT=454_illu_6_sub_vs_1MbBAC.RG.bam
    3. remove unmapped reads from the bam file:
      1. samtools view -F 4 -b -h -o 454_vs_ref_1MbBAC_c30.F4.bam 454_vs_ref_1MbBAC_c30.bam
    4. select one region out of whole-genome bam and sort it as well
      1. crocea@mahogany:/Network/Data/vervet/ref$ samtools view -u 454_vs_hg19_20101230.bam 9606_chr9:124,000,000-124,210,000 | samtools sort - 454_vs_hg19_20101230.chr9_124M_124.2M
        • option -u renders samtools to output uncompressed BAM to reduce the computing time.
  6. convert bam into sam for checking
    1. samtools view -h alignment.bam > alignment.sam
      • "-h" : include the header in the output
  7. view alignment through samtools tview
    1. crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr9_124M_124.2M.bam
    2. crocea@mahogany:/Network/Data/vervet/ref$ samtools tview 454_vs_hg19_20101230.chr9_124M_124.2M.bam ../../NCBI/hs_genome.fasta
  8. view alignment through IGV
    1. select "Import Genome ..." to import a fasta file and create a GenomeName.genome (i.e. myHG19.genome) file
      1. Fetch the gene annotation file in BED format from the UCSC Browser (table browser). choose a genome (i.e. hg19) as assembly
        choose group: Genes and Gene Prediction tracks,
        choose track: UCSC Genes (more isoforms, less accurate) or RefSeqGenes (less isoforms, more accurate)
        choose: output format: BED (or any other format you'd like, but BED can be displayed within the genome browser)
        enter some file name in the text box below and click get output,
        you then get referred to another page, where you may choose to get all exons, additional intron coordinates, just coding exons, 5'UTR exons, ...)
    2. index the bam file first (required)
      • crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam
    3. crocea@banyan:/Network/Data/vervet/ref$ java -jar /home/crocea/script/vervet/bin/IGVTools/igvtools.jar count 454_vs_hg19_20101230.bam 454_vs_hg19_20101230.depth ../../NCBI/myHG19.genome
  9. variant calling through samtools
    1. samtools mpileup -D -d 40 -ugf .../NCBI/hs_genome.fasta      454_vs_hg19_20101230.chr9_124M_124.2M.bam 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam 454_vs_hg19_20101230.chr20_4.085M_4.231M.bam | bcftools view -bvcg -p 1 - > 454_vs_hg19_20101230_3eQTL.raw.bcf
      • "-d 40" instructs mpileup to, at a position, read maximally INT reads per input BAM. [250]
      • "-p 1" : -p FLOAT variant if P(ref|D)
      • "-D" instructs mpileup to output per-sample read depth.
      • If each bam has read-group added, samtools would use that to mark genotypes from that sample. If not, filename is used as replacement.
      • samtools output genotype columns in the order of input bams.
    2. crocea@banyan:/Network/Data/vervet/ref$ bcftools view 454_vs_hg19_20101230_3eQTL.raw.bcf | vcfutils.pl varFilter -D100 > 454_vs_hg19_20101230_3eQTL_D100.raw.vcf
    3. Note 2011-9-21: samtools assigns 0/0 to zero-read-coverage loci while GATK just assigns ./. (missing).
  10. variant calling through GATK
    1. add a read group to one bam file (if multiple, could use "samtools merge" exemplified above) http://seqanswers.com/forums/showthread.php?t=4180
      • echo -e "@RG\tID:454\tSM:hs\tLB:454\tPL:454" > rg.txt
      • samtools view -h 454_vs_hg19.3eQTL.bam | cat rg.txt - | awk '{ if (substr($1,1,1)=="@") print; else printf "%s\tRG:Z:454\n",$0; }' | samtools view -bhS - > 454_vs_hg19.3eQTL.RG454.bam
      • alternatives include "~/script/vervet/src/addRGToBAM.sh", AddOrReplaceReadGroups.jar from picard
    2. create the fasta sequence dict file:
      • package from picard
       java -jar ~/script/vervet/bin/picard-tools/CreateSequenceDictionary.jar R= 1Mb-BAC.fasta O= 1Mb-BAC.dict
    3. create the fasta index file
      1. samtools faidx 1Mb-BAC.fasta
    4. index the bam file
      1. samtools index GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam
    5. make sure the header of bam file contains the same set of "SQ"s (reference sequences) as the fasta file. Otherwise GATK will either generate an empty vcf when bam header contains more than fasta file and the 1st fasta record is not the first SQ in header or complain some sequence number (for reference sequence) is not found when fasta file contains more.
      1. make sure reads in this bam file are all aligned to references from this set of "SQ"s
      2. For PE reads only, throw away reads whose mates are not based on this set of "SQ"s, otherwise, this bam file would be corrupted.
      3. use picard's ValidateSamFile.jar to validate (although not everything reported is fatal)
    6. user@mahogany:/Network/Data/vervet/ref$ java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam -R ~/script/vervet/data/ref/BAC/1Mb-BAC.fa -T UnifiedGenotyper --out 454_vs_ref_1MbBAC.F4.GATK.vcf -U -S SILENT
      • -S,--validation_strictness         How strict should we be with validation (STRICT|LENIENT| SILENT)
      • -U,--unsafe                                       If set, enables unsafe operations: nothing will be checked at runtime.  For expert users only who know what they are doing.  We do not support usage of this argument. (ALLOW_UNINDEXED_BAM| ALLOW_EMPTY_INTERVAL_LIST| ALLOW_UNSET_BAM_SORT_ORDER|NO_READ_ORDER_VERIFICATION| ALLOW_SEQ_DICT_INCOMPATIBILITY|ALL)
      • -nt,--num_threads                            How many threads should be allocated to running this analysis.
      • GATK outputs genotype columns in the order of read-groups, rather than order of "-I" when you have multiple input BAM files.
      • The order of genotype columns in vcf matters because vcf-isect (intersection) from vcftools requires input VCF files have the same set of genotype columns and they must be in the same order.
  11. base quality score re-calibration
    1. java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO  -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T CountCovariates   -cov ReadGroupCovariate    -cov QualityScoreCovariate    -cov CycleCovariate    -cov DinucCovariate -recalFile  454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -B:mask,VCF 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.GATK.vcf
    2. java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO  -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T TableRecalibration  --out 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal.bam -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv
  12. analyze quality covariates by GATK's AnalyzeCovariates.jar
    1. analyze original data
      1. run CountCovariates on the orignal bam file
      2. java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/AnalyzeCovariates.jar  -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -resources ~/script/vervet/bin//GenomeAnalysisTK-1.0.4705/resources -outputDir beforeCalibration/
    2. analyze GATK-calibrated data
      1. run CountCovariates on the table-re-calibrated bam file
      2. run AnalyzeCovariates.jar  as above
  13. Calculate coverage per sample, per library, per readgroup , per interval
    1. java -jar ~/script/gatk/dist/GenomeAnalysisTK.jar -T DepthOfCoverage -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -I /Network/Data/vervet/db/individual_alignment/592_52_vs_524_by_2.bam -I /Network/Data/vervet/db/individual_alignment/637_106_vs_524_by_2.bam -pt sample -o /Network/Data/vervet/vervetPipeline/tmp/aln_592_637_depthOfCoverage --omitDepthOutputAtEachBase -L /tmp/contig.interval_list
    2. the interval file contains this on each line (position is one-based. bed format doesn't work. http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK#Intervals is not true.):
      1. Contig1:1-13722612
    3. Among output, PREFIX.sample_summary, PREFIX.sample_interval_summary are important. Replace "sample" with readgroup or library if other parition type is selected.
    4. Could handle multiple partition types via adding more "-pt"s. Like "-pt readgroup -pt sample -pt library".
  14. select variants from VCF input and output in VCF
    1. http://www.broadinstitute.org/gsa/wiki/index.php/SelectVariants
    2. select by any expression (watch: DP or QD or other attributes are taken from the "INFO" field, not individual sample's genotype. For per-sample filtering, first want ).
    3.  java -Xmx2g -jar GenomeAnalysisTK.jar \
         -R ref.fasta \
         -T SelectVariants \
         --variant input.vcf \
         -o output.vcf \
         -se 'SAMPLE.+PARC'
         -select "QD > 10.0"
    4. select only INDEL
    5.  java -Xmx2g -jar GenomeAnalysisTK.jar \
         -R ref.fasta \
         -T SelectVariants \
         --variant input.vcf \
         -o output.vcf \
         -selectType INDEL
    6. many more
  15. Combine variants from various VCF through GATK's CombineVariants
    1. combine two single-sample VCF into one
    2. java  -Xmx2g -jar /home/crocea/script/gatk/dist/GenomeAnalysisTK.jar -R  /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -T CombineVariants  -o tmp/CombineSAMtools555_556.vcf -V:foo tmp/SelectSAMtoolsContig0_555_15_1987079_GA_vs_524.vcf -V:foo1 tmp/SelectSAMtoolsContig0_556_16_1985088_GA_vs_524.vcf
    3. more 

variant tools

http://varianttools.sourceforge.net/Main/HomePage

variant tools is a software tool for the annotation, selection, and analysis of variants in the context of next-gen sequencing analysis. Unlikely some other tools used for Next-Gen sequencing analysis, variant tools is project based and provide a whole set of tools to manipulate and analyze genetic variants, for example,


GBrowse and biotoolbox - sets of perl programs

1. http://gmod.org/wiki/GBrowse


GBrowse is a combination of database and interactive web pages for manipulating and displaying annotations on genomes. Features include:
  • Simultaneous bird's eye and detailed views of the genome.
  • Scroll, zoom, center.
  • Use a variety of premade glyphs or create your own.
  • Attach arbitrary URLs to any annotation.
  • Order and appearance of tracks are customizable by administrator and end-user.
  • Search by annotation ID, name, or comment.
  • Supports third party annotation using GFF formats.
  • Settings persist across sessions.
  • DNA and GFF dumps.
  • Connectivity to different databases, including BioSQL and Chado.
  • Multi-language support.
  • Third-party feature loading.
  • Customizable plug-in architecture (e.g. run BLAST, dump & import many formats, find oligonucleotides, design primers, create restriction maps, edit features)

2. http://code.google.com/p/biotoolbox/

Lists of programs organized by function.

Data collection

Data analysis

Dataset manipulation

File format conversion

Finding features

Genome annotation

Illumina sequencing, BAM files

Nucleosome Analysis

Microarray

SNP calling

Miscellaneous

2012年9月25日星期二

NCBI2R: NCBI2R-An R package to navigate and annotate genes and SNPs

http://cran.r-project.org/web/packages/NCBI2R/index.html

http://ncbi2r.wordpress.com/

NCBI2R is an R package that annotates lists of SNPs and/or genes, with current information from NCBI, including LD information. Functions are provided that with one command will provide annotation of the results from genome wide association studies to provide a broader context of their meaning. Other functions enable comparisons between a user's GWA results, and candidate snp/gene lists that are created from keywords, such as specific diseases, phenotypes or gene ontology terms. Commands are simple to follow and designed to work with R objects to integrate into existing workflows. The output produces text fields and weblinks to more information for items such as: gene descriptions, nucleotide positions, OMIM, pathways, phenotypes, and lists of interacting and neighboring genes. Annotation can then be used in R for further analysis, or the objects can be customized for use in spreadsheet programs or web browsers. The NCBI2R package wasdesigned to allow those performing genome analysis to produce output that could easily be understood by a person not familiar with R. Please see the website at http://NCBI2R.wordpress.com for more information. The internet is required for almost all of these functions. Use the function PrintNCBI2RInfo() for information.

pibase - validational and comparative analysis of BAM files

http://www.ikmb.uni-kiel.de/pibase/

Essentials:
pibase_bamrefExtract information from a BAM-file and a reference sequence file and table this information into a tab-separated text file.
pibase_consensusInfer 'best' genotypes and their 'quality' classification, and optionally merge multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single file.
pibase_fisherdiffCompare two pibase_consensus files using Fisher's exact test on original data (aligned reads) rather than comparing processed data (SNP-calls or genotypes).


Annotate:
pibase_tosnpactsWorks with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_annotWorks with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_tagAdd primer region tags to a pibase_bamref, pibase_consensus, pibase_fisherdiff, or pibase_annot file.


Phylogenetics:
pibase_to_rdfGenerate rdf file (for phylogenetic network analysis) from set of pibase_fisherdiff files.
pibase_rdf_refGenerate a reference sample file from a pibase_consensus file, prior to generating the set of pibase_fisherdiff files required for pibase_to_rdf.
pibase_chrm_to_crsExtract Cambridge Reference Sequence (Anderson 1981) variants from reads mapped to chrM (hg18, hg19, or NCBI36).


Utilities:
pibase_to_vcfConvert a single-sample pibase file into VCFv4.1 format.
pibase_c_to_contigConvert pibase-contig-numbers into contig names (e.g. 25 -> chrM).
pibase_flagsnpFlag non-reference genotypes in a pibase_consensus file as potential mismatches ("SNPs" in NGS-parlance).
pibase_diffCompare two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold.
pibase_gen_from_snpactsWorks with our unpublished annotation pipeline. Contact us if you wish us to merge SNPs from SNP-callers, SNP-chips, or Sanger sequences into your pibase files for a genotype-comparison.
pibase_refGets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file. For example for manual Sanger-sequencing primer design.

BamUtil: stats

http://genome.sph.umich.edu/wiki/BamUtil:_stats


Basic (--basic)

Prints summary statistics for the file:
  • TotalReads - # of reads that are in the file
  • MappedReads - # of reads marked mapped in the flag
  • PairedReads - # of reads marked paired in the flag
  • ProperPair - # of reads marked paired AND proper paired in the flag
  • DuplicateReads - # of reads marked duplicate in the flag
  • QCFailureReads - # of reads marked QC failure in the flag
  • MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  • PairedReads(%) - # of reads marked paired in the flag / TotalReads
  • ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  • DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  • QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  • TotalBases - # of bases in all reads
  • BasesInMappedReads - # of bases in reads marked mapped in the flag

Qual/Phred (--phred and --qual)

Prints a count of the number of times each quality value appears in the file to stderr.
  • phred Displays Quality as phred integers [0-93]
  • qual Displays Quality as non-phred integers (phred + 33) [33-126]
By default, these counts include all qualities in the BAM file.
To exclude unmapped reads and soft clips, use --excludeFlags 4.
To only include records that overlap a set of regions, use --regionList and specify a bed file with the regions. If a read overlaps the region, all qualities will be counted even if those bases do not fall in the region. If you only want to count qualities that fall within the region, also specify --withinRegion. Without excluding unmapped reads, it will include soft clips that overlap the region.

BaseQC (--pBaseQC and --cBaseQC and --baseSum)

The pBaseQC and cBaseQC options generate per base statistics. Only one of these two options can be specified. They write statistics generated for each position to the file specified after the option. They use the same logic for calculating statistics, but pBaseQC writes the statistics as percentages, and cBaseQC writes them as counts. The order of the statistics are also different.
The baseSum option can be used with either pBaseQC or cBaseQC or on its own. baseSum generates a summary of the per position statistics and writes it to stderr. It calculates the per position base statistics even if they will not be written anywhere (neither pBaseQC nor cBaseQC are specified).

All three options use the same logic for calculating the statistics:
  • A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip. CIGAR operations allowed for the position are M/X/=/D/N. If the CIGAR is '*', only numbers for the specified reference position are incremented.
  • Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.

FASTX-toolkit - on fasta or fastq files


BAMStats: summarising Next Generation Sequencing alignments

http://bamstats.sourceforge.net/

Mapping is a prerequisite for most next generation sequencing workflows and the SAM/BAM file format (1) is the de facto standard for storing such large sequence alignments. BAMStats, is a simple software tool built on the Picard Java API (2), which  can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.

2012年9月23日星期日

SNPRelate and gdsfmt - R package for NGS and GWAS

gdsfmt and SNPRelate are high-performance computing R packages for multi-core symmetric multiprocessing computer architectures. They are used to accelerate two key computations is GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of our algorithms are written in C/C++, and have been highly optimized. Benchmarks show the uniprocessor implementations of PCA and IBD are ~10 to 45 times faster than the implementations provided by the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs respectively, and can be sped up to 70~250 folds by utilizing eight cores. SNPRelate can analyze tens of thousands of samples, with millions of SNPs.

http://corearray.sourceforge.net/tutorials/SNPRelate/

consensus calls (SNP/Indel/Reference) from a mpileup file


http://varscan.sourceforge.net/using-varscan.html#v2.3_mpileup2cns



mpileup2cns

This command makes consensus calls (SNP/Indel/Reference) from a mpileup file based on user-defined parameters:
 USAGE: java -jar VarScan.jar mpileup2cns [mpileup file] OPTIONS
        mpileup file - The SAMtools mpileup file

 OPTIONS:
 --min-coverage Minimum read depth at a position to make a call [8]
 --min-reads2 Minimum supporting reads at a position to call variants [2]
 --min-avg-qual Minimum base quality at a position to count a read [15]
 --min-var-freq Minimum variant allele frequency threshold [0.01]
 --min-freq-for-hom Minimum frequency to call homozygote [0.75]
 --p-value Default p-value threshold for calling variants [99e-02]
 --strand-filter Ignore variants with >90% support on one strand [1]
 --output-vcf If set to 1, outputs in VCF format
 --variants Report only variant (SNP/indel) positions (mpileup2cns only) [0]

  
 OUTPUT
 Tab-delimited SNP calls with the following columns:
 Chrom  chromosome name
 Position position (1-based)
 Ref   reference allele at this position
 Var   variant allele observed
 PoolCall Cross-sample call using all data (Cons:Cov:Reads1:Reads2:Freq:P-value)
    Cons - consensus genotype in IUPAC format
    Cov - total depth of coverage
    Reads1 - number of reads supporting reference
    Reads2 - number of reads supporting variant
    Freq - the variant allele frequency by read count
    P-value - FET p-value of observed reads vs expected non-variant
 StrandFilt Information to look for strand bias using all reads, format R1+:R1-:R2+:R2-:pval
    R1+ = reference supporting reads on forward strand
    R1- = reference supporting reads on reverse strand
    R2+ = variant supporting reads on forward strand
    R2- = variant supporting reads on reverse strand
    pval = FET p-value for strand distribution, R1 versus R2
 SamplesRef Number of samples called reference (wildtype)
 SamplesHet Number of samples called heterozygous-variant
 SamplesHom Number of samples called homozygous-variant
 SamplesNC Number of samples not covered / not called
 SampleCalls The calls for each sample in the mpileup, space-delimited
       Each sample has six values separated by colons:
   Cons - consensus genotype in IUPAC format
   Cov - total depth of coverage
   Reads1 - number of reads supporting reference
   Reads2 - number of reads supporting variant
   Freq - the variant allele frequency by read count
   P-value - FET p-value of observed reads vs expected non-variant

2012年9月17日星期一

tips on installing EggLib on linux

The following lines from the author of EggLib (http://egglib.sourceforge.net/), Stéphane.

#############################
But first, check if you have the "cmake" program installed on the server. If so, the preferred approach is to issue the following commands in each of bpp-core, bpp-seq and bpp-popgen (in that order, and the other libraries are not needed by egglib).

$ cmake -DCMAKE_INSTALL_PREFIX=~/local
$ make install

(where ~/local is a directory in you local home folder).

For installing egglib locally, I have managed to do it several times myself, and I found my trick to be good enough at the moment. Please check the following tutorial:

Installation of egglib on a server without administration rights.

1) The server must be POSIX-compliant (as are Linux, MacOS, Solaris).

2) Assume the user home path is: /home/alfred (shortcut as ~).
   Replace it by whatever is the correct path.  Don't use the "~"
   shortcut when invoking "python setup.py".

3) Create a `local` directory for installed software.
   Commonly, one will use ~/.local, but here we will use ~/local.

    $ mkdir ~/local

4) Get the needed source archives within the `local` directory:
    - egglib-cpp-2.1.3.tar.gz
    - eggcoal-2.1.3.tar.gz
    - eggstats-2.1.3.tar.gz
    - egglib-py-2.1.3.tar.gz
    (from http://sourceforge.net/projects/egglib/files/current/
)

    and decompress them here.
   
5) Go to egglib-cpp-2.1.3 and type the following commands:

   $ ./configure --prefix=/home/alfred
   $ make
   $ make install

6) If you need them, go to eggcoal and eggstats directories and type
   the following commands:
  
   $ ./configure --prefix=/home/alfred/local LDFLAGS=-L/home/alfred/local/lib CPPFLAGS=-I/home/alfred/local/include
   $ make
   $ make install

7) Go to egglib-py and type the following:

   $ python setup.py build --prefix=/home/alfred/local
   $ python setup.py build_apps --prefix=/home/alfred/local

   The first command should take some time while it is compiling.
  
8) It is not possible to install egglib locally, but you can workaround
   to get the python module operating.

    $ mkdir ~/local/python
    $ cp -R build/lib-linux-*/* ~/local/python
    $ cp -R build/scripts-*/egglib ~/local/python/egglibrun

9) EggLib is now locally installed, but Python is not aware of it. If
   you need to write or run python scripts using egglib, you need to
   start the files with the following:
  
   import sys
   sys.path.append('/home/alfred/local/python')
   import egglib

tricks to install libsequence and its sub programs in Ubuntu linux

1. the important reference for this operation should be this link from Ross-Ibarra Lab:
http://rilabwiki.ucdavis.edu/doku.php?id=installing_libsequence

2. However, when you install GSL, you will sometimes meet problems. Then you could go to this link to fix the problem by installation of some GSL dependences.

http://beans.seartipy.com/2006/03/15/installing-c-boost-on-gentoo-and-debianubuntu/

######################
# make sure which version of libboost available (in the example 1.33.1), and substitute "1.33.1" with the right version code (currently, "1.41").

sh# apt-get   install   libboost-date-time-dev libboost-date-time1.33.1   libboost-dev   libboost-doc   libboost-filesystem-dev   libboost-filesystem1.33.1   libboost-graph-dev   libboost-graph1.33.1   libboost-iostreams-dev   libboost-iostreams1.33.1 libboost-program-options-dev   libboost-program-options1.33.1   libboost-python-dev   libboost-python1.33.1   libboost-regex-dev   libboost-regex1.33.1   libboost-signals-dev   libboost-signals1.33.1   libboost-test-dev   libboost-test1.33.1   libboost-thread-dev   libboost-thread1.33.1  
Or,
sh# apt-get install libboost.*-dev libboost-doc libboost.*1.33.1
 
 3. You need to specify the right PATH to apply "sudo ./bjam install", when you are install boost library.

# add the following two lines to you ".bash_profile" file

export PATH="$PATH:/home/jmao/Downloads/boost_1_51_0"
export PATH="$PATH:/home/jmao/Downloads/boost_1_51_0/stage/lib"

2012年9月15日星期六

The UCSC genome browser and associated tools

http://bib.oxfordjournals.org/content/early/2012/08/18/bib.bbs038.full

The UCSC Genome Browser (http://genome.ucsc.edu) is a graphical viewer for genomic data now in its 13th year. Since the early days of the Human Genome Project, it has presented an integrated view of genomic data of many kinds. Now home to assemblies for 58 organisms, the Browser presents visualization of annotations mapped to genomic coordinates. The ability to juxtapose annotations of many types facilitates inquiry-driven data mining. Gene predictions, mRNA alignments, epigenomic data from the ENCODE project, conservation scores from vertebrate whole-genome alignments and variation data may be viewed at any scale from a single base to an entire chromosome. The Browser also includes many other widely used tools, including BLAT, which is useful for alignments from high-throughput sequencing experiments. Private data uploaded as Custom Tracks and Data Hubs in many formats may be displayed alongside the rich compendium of precomputed data in the UCSC database. The Table Browser is a full-featured graphical interface, which allows querying, filtering and intersection of data tables. The Saved Session feature allows users to store and share customized views, enhancing the utility of the system for organizing multiple trains of thought. Binary Alignment/Map (BAM), Variant Call Format and the Personal Genome Single Nucleotide Polymorphisms (SNPs) data formats are useful for visualizing a large sequencing experiment (whole-genome or whole-exome), where the differences between the data set and the reference assembly may be displayed graphically. Support for high-throughput sequencing extends to compact, indexed data formats, such as BAM, bigBed and bigWig, allowing rapid visualization of large datasets from RNA-seq and ChIP-seq experiments via local hosting.


HighSSR: High throughput SSR characterization and locus development from next gen sequencing data

http://bioinformatics.oxfordjournals.org/content/early/2012/09/06/bioinformatics.bts524.abstract

http://code.google.com/p/highssr/

VarB: A Variation Browsing and analysis tool for variants derived from next-generation sequencing data

http://www.pathogenseq.org/varb

http://bioinformatics.oxfordjournals.org/content/early/2012/09/13/bioinformatics.bts557.abstract


2012年9月13日星期四

UCSC Source Tree Utilities

1.
programs for sorting, splitting, or merging fasta sequences;
record parsing and data conversion using GenBank, fasta, nib, and blast data formats;
sequence alignment;
motif searching;
hidden Markov model development; and much more.

Library subroutines are available for everything from managing C data structures such as linked lists, balanced trees, hashes, and directed graphs to developing routines for SQL, HTML, or CGI code. Additional library functions are available for biological sequence and data manipulation tasks such as reverse complementation, codon and amino acid lookup and sequence translation, as well as functions specifically designed for extracting, loading, and manipulating data in the UCSC Genome Browser Databases.

2. install it in Debian/Ubuntu
http://genomewiki.ucsc.edu/index.php/Source_tree_compilation_on_Debian/Ubuntu

3.

Compiling UCSC Source Tree Utilities on OSX


1) MySQL is the prerequisite for our building. Although it is installed on Rice SUG@R cluster, I cannot find the file “libmysqlclient.a” on SUG@R, which will be used in UCSC Kent Utilities building. So I have to installed a copy of MySQL in my home directory. The newest version of MySQL is 5.5.24 and it can be downloaded here. I downloaded the source code version for Generic Linux (Architecture Independent). The current version of MySQL need CMake to compile, so before we can do anything with MySQL, we need to have to back up a step and install CMake first.
2tar -xzvf cmake-2.8.8.tar.gz
3cd cmake-2.8.8
4> ./configure --prefix=/users/NetID/local/
5> gmake
6make install
Then install MySQL using CMake:
1tar xvzf mysql-5.5.24.tar.gz
2cd mysql-5.5.24
3mkdir -p /users/NetID/local/mysql
4> cmake -DCMAKE_INSTALL_PREFIX=/users/NetID/local/mysql
5make
6make install
Of course, we still need more work to configure MySQL to let it work well. But since we have MySQL support on SUG@R already and only need a file from this installation, so we can take care of those configuration steps later when we actually need to run our local copy of MySQL.
2) Now we can finally work on building UCSC Kent Utilities. Check the shell environment first. We need to change the value of MACHTYPE to x86_64 for our build and also set this in the .bashrc file under my home directory.
1echo $MACHTYPE
2> x86_64-redhat-linux-gnu
3> MACHTYPE=x86_64
4> emacs .bashrc
5export MACHTYPE=x86_64 # add this line into the .bashrc file
3) Under the $Home/bin/ directory, create a directory named x86_64 for those binary files generated during our compiling.
1mkdir -p $HOME/bin/${MACHTYPE}
4) Create the MySQL shell environment variables we need:
1> MYSQLINC=/usr/include/mysql
2> MYSQLLIBS="/users/NetID/local/mysql/lib/libmysqlclient.a -lz"
3export MYSQLINC MYSQLLIBS
5) Now we are ready to compile those utilities. Unzip the UCSC source code we downloaded and we will see a resulting directory named kent.
1> unzip jksrc.zip
2cd kent/src
3make libs
4cd utils/
5make
6) Then we are done. All those complied binary utilities can be found in $HOME/bin/$MACHTYPE directory. For me, it is /users/NetID/bin/x86_64/. Just try one, say faSplit. After typing faSplit in shell, we see the usage for this utility. So our build works!
01cd $HOME/bin/$MACHTYPE
02> ./faSplit
03> faSplit - Split an fa file into several files.
04> usage:
05>   faSplit how input.fa count outRoot
06> where how is either 'about' 'byname' 'base' 'gap' 'sequence' or'size'.
07> Files split by sequence will be broken at the nearest fa record boundary.
08> Files split by base will be broken at any base.
09> Files broken by size will be broken every count bases.
10...