evolving all we are: 九月 2012

2012年9月29日星期六

bioinf-perl scripts - from Isaac Turner

https://github.com/noporpoise/bioinf-perl

name	age	history message
cortex_scripts	2 months ago	Added: cortex_aligned_reads.pl, fastn_select.pl & vcf_combine_alleles… [noporpoise]
fastn_scripts	17 days ago	Added offset setting in RefGenome; Added ->close_vcf() to VCFFile; [noporpoise]
genetics	7 months ago	Added genetics scripts; changed add_rmsk behaviour; removed old vcf s… [noporpoise]
lib	17 days ago	Added offset setting in RefGenome; Added ->close_vcf() to VCFFile; [noporpoise]
utilities	a month ago	Added more robust file reading to md5check.pl [noporpoise]
vcf_scripts	10 days ago	Fixed right flank offset bug [noporpoise]
README	5 months ago	Fixed README [noporpoise]

backup your data with bitcasa

http://www.bitcasa.com/download/

Bitcasa is the personal app that integrates infinite storage, sync, backup and share into your desktop and across all your devices.

2012年9月28日星期五

identify relatedness of individuals from SNP data

http://pevsnerlab.kennedykrieger.org/kcoeff.html

markdown, knitr, latex, and R

http://jeromyanglim.blogspot.ca/2012/07/beamer-pandoc-markdown.html

A good tutorial.

An Example Differential Expression Analysis Using RNA-Seq Data This is an example analysis of RNA-seq data using open source tools R and Bioconductor. It starts with raw reads downloaded from the Short Read Archive (SRA), does quality assessment and improvement, mapping, and analysis

https://github.com/vsbuffalo/rna-seq-example

This is an example analysis of RNA-seq data using open source tools R and Bioconductor. It starts with raw reads downloaded from the Short Read Archive (SRA), does quality assessment and improvement, mapping, and analysis.

after you got a new reference genome - operation you need to do are

Got new genome fasta file?

#1. hg19.fasta.idx
samtools faidx hg19.fasta

#2. hg19.dict
java -Xmx2g -jar ~/tool/picard/CreateSequenceDictionary.jar R=hg19.fasta O=hg19.dict

2012年9月27日星期四

tips of NGS data processing

http://crocea.mednet.ucla.edu/research/variation/next-gen-sequence-how-to
one referene genome in one fasta file

FileFormatExchange.combineMultiFastaIntoOne() in variation.src.misc

bwa index

bwa index -a bwtsw hsRef.fa

"bwtsw" is for genome >2Gb

move all alignment short reads into one directory

for all subspecies

for j in aethiops cynosurus pygerythrus sabaeus tantalus; do echo $j; for i in `find /u/home/eeskin/namtran/Experiment/Genome/Freimer/Chlorocebus/INPUT/$j/ -name sequence*.tar.gz -print`; do echo $i; ln -s $i ./$j ; done; done

for sequences stored in bam format on WUSTL server

run shell/wgetGivenURL.sh to fetch sequence bam files from WUSTL
run picard to convert bam to paired-end fastq

java -jar ~/script//vervet/bin/picard-tools/SamToFastq.jar INPUT=gerald_62FGFAAXX_3.bam F=gerald_62FGFAAXX_3_1.fastq F2=gerald_62FGFAAXX_3_2.fastq

bwa alignment (aln or bwtsw; single-end or paired-end)

vervetdb.src.MpiBWA.py including these steps

align

input fastq files can be in plain text or gzipped.
NOTE: For bwtsw, one read could be found in >1 alignments in final bam file. i.e. 212904 unique reads among 217507 mapped reads, max redundant read count=4. ~2% are found to be in >1 alignments.

convert sam to bam

samtools view -F 4 -Sbh alignment.sam > alignment.bam

"-F 4" instructs samtools to remove unmapped reads

merge all bam files into one bam (have to be sorted if "bwa sort" is used)

perl -e 'print "\@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n\@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt

This is to tag the 454 and illumina results by read group (RG). Usually not necessary.

samtools merge -rh rg.txt - *.bam | samtools rmdup - - | samtools rmdup -s - aln.bam

duplicates removal for single-end reads is not recommended given very deep coverage as this step may discard too many good alignments.
"-rh rg.txt" is again to add RG header into the final bam file. Without it, the command is: samtools merge merged.bam *.bam

samtools merge - *.bam | samtools view -bh -F 4 - -o aln.bam

merge all bam files and remove unaligned reads
WATCH "-" in "samtools merge". It instructs "merge" to output to stdout.
If read group (RG) tags are present in original bam files, samtools merge only takes RG info in the header from the first bam file.

picard merge would combine headers from all bam files, rather than the header from the first bam.

java -jar ~/script//vervet/bin/picard-tools/MergeSamFiles.jar INPUT=ref_illu_vs_1MbBAC_by_aln.RG.bam INPUT=454_vs_ref_1MbBAC.F4.RG.bam INPUT=cynosurus_vs_1MbBAC_by_aln.RG.bam USE_THREADING=true ASSUME_SORTED=true OUTPUT=454_illu_6_sub_vs_1MbBAC.RG.bam

remove unmapped reads from the bam file:

samtools view -F 4 -b -h -o 454_vs_ref_1MbBAC_c30.F4.bam 454_vs_ref_1MbBAC_c30.bam

select one region out of whole-genome bam and sort it as well

crocea@mahogany:/Network/Data/vervet/ref$ samtools view -u 454_vs_hg19_20101230.bam 9606_chr9:124,000,000-124,210,000 | samtools sort - 454_vs_hg19_20101230.chr9_124M_124.2M

option -u renders samtools to output uncompressed BAM to reduce the computing time.

convert bam into sam for checking

samtools view -h alignment.bam > alignment.sam

"-h" : include the header in the output

view alignment through samtools tview

crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr9_124M_124.2M.bam
crocea@mahogany:/Network/Data/vervet/ref$ samtools tview 454_vs_hg19_20101230.chr9_124M_124.2M.bam ../../NCBI/hs_genome.fasta

view alignment through IGV

select "Import Genome ..." to import a fasta file and create a GenomeName.genome (i.e. myHG19.genome) file

Fetch the gene annotation file in BED format from the UCSC Browser (table browser). choose a genome (i.e. hg19) as assembly
choose group: Genes and Gene Prediction tracks,
choose track: UCSC Genes (more isoforms, less accurate) or RefSeqGenes (less isoforms, more accurate)
choose: output format: BED (or any other format you'd like, but BED can be displayed within the genome browser)
enter some file name in the text box below and click get output,
you then get referred to another page, where you may choose to get all exons, additional intron coordinates, just coding exons, 5'UTR exons, ...)

index the bam file first (required)

crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam

crocea@banyan:/Network/Data/vervet/ref$ java -jar /home/crocea/script/vervet/bin/IGVTools/igvtools.jar count 454_vs_hg19_20101230.bam 454_vs_hg19_20101230.depth ../../NCBI/myHG19.genome

variant calling through samtools

samtools mpileup -D -d 40 -ugf .../NCBI/hs_genome.fasta 454_vs_hg19_20101230.chr9_124M_124.2M.bam 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam 454_vs_hg19_20101230.chr20_4.085M_4.231M.bam | bcftools view -bvcg -p 1 - > 454_vs_hg19_20101230_3eQTL.raw.bcf

"-d 40" instructs mpileup to, at a position, read maximally INT reads per input BAM. [250]
"-p 1" : -p FLOAT variant if P(ref|D)
"-D" instructs mpileup to output per-sample read depth.
If each bam has read-group added, samtools would use that to mark genotypes from that sample. If not, filename is used as replacement.
samtools output genotype columns in the order of input bams.

crocea@banyan:/Network/Data/vervet/ref$ bcftools view 454_vs_hg19_20101230_3eQTL.raw.bcf | vcfutils.pl varFilter -D100 > 454_vs_hg19_20101230_3eQTL_D100.raw.vcf
Note 2011-9-21: samtools assigns 0/0 to zero-read-coverage loci while GATK just assigns ./. (missing).

variant calling through GATK

add a read group to one bam file (if multiple, could use "samtools merge" exemplified above) http://seqanswers.com/forums/showthread.php?t=4180

echo -e "@RG\tID:454\tSM:hs\tLB:454\tPL:454" > rg.txt
samtools view -h 454_vs_hg19.3eQTL.bam | cat rg.txt - | awk '{ if (substr($1,1,1)=="@") print; else printf "%s\tRG:Z:454\n",$0; }' | samtools view -bhS - > 454_vs_hg19.3eQTL.RG454.bam
alternatives include "~/script/vervet/src/addRGToBAM.sh", AddOrReplaceReadGroups.jar from picard

create the fasta sequence dict file:

package from picard

 java -jar ~/script/vervet/bin/picard-tools/CreateSequenceDictionary.jar R= 1Mb-BAC.fasta O= 1Mb-BAC.dict

create the fasta index file

samtools faidx 1Mb-BAC.fasta

index the bam file

samtools index GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam

make sure the header of bam file contains the same set of "SQ"s (reference sequences) as the fasta file. Otherwise GATK will either generate an empty vcf when bam header contains more than fasta file and the 1st fasta record is not the first SQ in header or complain some sequence number (for reference sequence) is not found when fasta file contains more.

make sure reads in this bam file are all aligned to references from this set of "SQ"s
For PE reads only, throw away reads whose mates are not based on this set of "SQ"s, otherwise, this bam file would be corrupted.
use picard's ValidateSamFile.jar to validate (although not everything reported is fatal)

user@mahogany:/Network/Data/vervet/ref$ java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam -R ~/script/vervet/data/ref/BAC/1Mb-BAC.fa -T UnifiedGenotyper --out 454_vs_ref_1MbBAC.F4.GATK.vcf -U -S SILENT

-S,--validation_strictness How strict should we be with validation (STRICT|LENIENT| SILENT)
-U,--unsafe If set, enables unsafe operations: nothing will be checked at runtime. For expert users only who know what they are doing. We do not support usage of this argument. (ALLOW_UNINDEXED_BAM| ALLOW_EMPTY_INTERVAL_LIST| ALLOW_UNSET_BAM_SORT_ORDER|NO_READ_ORDER_VERIFICATION| ALLOW_SEQ_DICT_INCOMPATIBILITY|ALL)
-nt,--num_threads How many threads should be allocated to running this analysis.
GATK outputs genotype columns in the order of read-groups, rather than order of "-I" when you have multiple input BAM files.
The order of genotype columns in vcf matters because vcf-isect (intersection) from vcftools requires input VCF files have the same set of genotype columns and they must be in the same order.

base quality score re-calibration

java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -B:mask,VCF 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.GATK.vcf
java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T TableRecalibration --out 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal.bam -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv

analyze quality covariates by GATK's AnalyzeCovariates.jar

analyze original data

run CountCovariates on the orignal bam file
java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/AnalyzeCovariates.jar -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -resources ~/script/vervet/bin//GenomeAnalysisTK-1.0.4705/resources -outputDir beforeCalibration/

analyze GATK-calibrated data

run CountCovariates on the table-re-calibrated bam file
run AnalyzeCovariates.jar as above

Calculate coverage per sample, per library, per readgroup , per interval

java -jar ~/script/gatk/dist/GenomeAnalysisTK.jar -T DepthOfCoverage -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -I /Network/Data/vervet/db/individual_alignment/592_52_vs_524_by_2.bam -I /Network/Data/vervet/db/individual_alignment/637_106_vs_524_by_2.bam -pt sample -o /Network/Data/vervet/vervetPipeline/tmp/aln_592_637_depthOfCoverage --omitDepthOutputAtEachBase -L /tmp/contig.interval_list
the interval file contains this on each line (position is one-based. bed format doesn't work. http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK#Intervals is not true.):

Contig1:1-13722612

Among output, PREFIX.sample_summary, PREFIX.sample_interval_summary are important. Replace "sample" with readgroup or library if other parition type is selected.
Could handle multiple partition types via adding more "-pt"s. Like "-pt readgroup -pt sample -pt library".

select variants from VCF input and output in VCF

http://www.broadinstitute.org/gsa/wiki/index.php/SelectVariants
select by any expression (watch: DP or QD or other attributes are taken from the "INFO" field, not individual sample's genotype. For per-sample filtering, first want ).

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC'
   -select "QD > 10.0"

select only INDEL

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType INDEL

many more

Combine variants from various VCF through GATK's CombineVariants

combine two single-sample VCF into one
java -Xmx2g -jar /home/crocea/script/gatk/dist/GenomeAnalysisTK.jar -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -T CombineVariants -o tmp/CombineSAMtools555_556.vcf -V:foo tmp/SelectSAMtoolsContig0_555_15_1987079_GA_vs_524.vcf -V:foo1 tmp/SelectSAMtoolsContig0_556_16_1985088_GA_vs_524.vcf
more

variant tools

http://varianttools.sourceforge.net/Main/HomePage

variant tools is a software tool for the annotation, selection, and analysis of variants in the context of next-gen sequencing analysis. Unlikely some other tools used for Next-Gen sequencing analysis, variant tools is project based and provide a whole set of tools to manipulate and analyze genetic variants, for example,

GBrowse and biotoolbox - sets of perl programs

1. http://gmod.org/wiki/GBrowse

GBrowse is a combination of database and interactive web pages for manipulating and displaying annotations on genomes. Features include:

Simultaneous bird's eye and detailed views of the genome.
Scroll, zoom, center.
Use a variety of premade glyphs or create your own.
Attach arbitrary URLs to any annotation.
Order and appearance of tracks are customizable by administrator and end-user.
Search by annotation ID, name, or comment.
Supports third party annotation using GFF formats.
Settings persist across sessions.
DNA and GFF dumps.
Connectivity to different databases, including BioSQL and Chado.
Multi-language support.
Third-party feature loading.
Customizable plug-in architecture (e.g. run BLAST, dump & import many formats, find oligonucleotides, design primers, create restriction maps, edit features)

2. http://code.google.com/p/biotoolbox/

Lists of programs organized by function.

Data collection

Data analysis

Dataset manipulation

File format conversion

Finding features

Genome annotation

Illumina sequencing, BAM files

Nucleosome Analysis

Microarray

SNP calling

Miscellaneous

2012年9月25日星期二

NCBI2R: NCBI2R-An R package to navigate and annotate genes and SNPs

http://cran.r-project.org/web/packages/NCBI2R/index.html

http://ncbi2r.wordpress.com/

NCBI2R is an R package that annotates lists of SNPs and/or genes, with current information from NCBI, including LD information. Functions are provided that with one command will provide annotation of the results from genome wide association studies to provide a broader context of their meaning. Other functions enable comparisons between a user's GWA results, and candidate snp/gene lists that are created from keywords, such as specific diseases, phenotypes or gene ontology terms. Commands are simple to follow and designed to work with R objects to integrate into existing workflows. The output produces text fields and weblinks to more information for items such as: gene descriptions, nucleotide positions, OMIM, pathways, phenotypes, and lists of interacting and neighboring genes. Annotation can then be used in R for further analysis, or the objects can be customized for use in spreadsheet programs or web browsers. The NCBI2R package wasdesigned to allow those performing genome analysis to produce output that could easily be understood by a person not familiar with R. Please see the website at http://NCBI2R.wordpress.com for more information. The internet is required for almost all of these functions. Use the function PrintNCBI2RInfo() for information.

pibase - validational and comparative analysis of BAM files

http://www.ikmb.uni-kiel.de/pibase/

Essentials:
pibase_bamref	Extract information from a BAM-file and a reference sequence file and table this information into a tab-separated text file.
pibase_consensus	Infer 'best' genotypes and their 'quality' classification, and optionally merge multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single file.
pibase_fisherdiff	Compare two pibase_consensus files using Fisher's exact test on original data (aligned reads) rather than comparing processed data (SNP-calls or genotypes).
Annotate:
pibase_tosnpacts	Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_annot	Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.
pibase_tag	Add primer region tags to a pibase_bamref, pibase_consensus, pibase_fisherdiff, or pibase_annot file.
Phylogenetics:
pibase_to_rdf	Generate rdf file (for phylogenetic network analysis) from set of pibase_fisherdiff files.
pibase_rdf_ref	Generate a reference sample file from a pibase_consensus file, prior to generating the set of pibase_fisherdiff files required for pibase_to_rdf.
pibase_chrm_to_crs	Extract Cambridge Reference Sequence (Anderson 1981) variants from reads mapped to chrM (hg18, hg19, or NCBI36).
Utilities:
pibase_to_vcf	Convert a single-sample pibase file into VCFv4.1 format.
pibase_c_to_contig	Convert pibase-contig-numbers into contig names (e.g. 25 -> chrM).
pibase_flagsnp	Flag non-reference genotypes in a pibase_consensus file as potential mismatches ("SNPs" in NGS-parlance).
pibase_diff	Compare two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold.
pibase_gen_from_snpacts	Works with our unpublished annotation pipeline. Contact us if you wish us to merge SNPs from SNP-callers, SNP-chips, or Sanger sequences into your pibase files for a genotype-comparison.
pibase_ref	Gets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file. For example for manual Sanger-sequencing primer design.

BamUtil: stats

http://genome.sph.umich.edu/wiki/BamUtil:_stats

Basic (`--basic`)

Prints summary statistics for the file:

TotalReads - # of reads that are in the file
MappedReads - # of reads marked mapped in the flag
PairedReads - # of reads marked paired in the flag
ProperPair - # of reads marked paired AND proper paired in the flag
DuplicateReads - # of reads marked duplicate in the flag
QCFailureReads - # of reads marked QC failure in the flag
MappingRate(%) - # of reads marked mapped in the flag / TotalReads
PairedReads(%) - # of reads marked paired in the flag / TotalReads
ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
DupRate(%) - # of reads marked duplicate in the flag / TotalReads
QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
TotalBases - # of bases in all reads
BasesInMappedReads - # of bases in reads marked mapped in the flag

Qual/Phred (`--phred` and `--qual`)

Prints a count of the number of times each quality value appears in the file to stderr.

phred Displays Quality as phred integers [0-93]
qual Displays Quality as non-phred integers (phred + 33) [33-126]

By default, these counts include all qualities in the BAM file.

To exclude unmapped reads and soft clips, use --excludeFlags 4.

To only include records that overlap a set of regions, use --regionList and specify a bed file with the regions. If a read overlaps the region, all qualities will be counted even if those bases do not fall in the region. If you only want to count qualities that fall within the region, also specify --withinRegion. Without excluding unmapped reads, it will include soft clips that overlap the region.

BaseQC (`--pBaseQC` and `--cBaseQC` and `--baseSum`)

The pBaseQC and cBaseQC options generate per base statistics. Only one of these two options can be specified. They write statistics generated for each position to the file specified after the option. They use the same logic for calculating statistics, but pBaseQC writes the statistics as percentages, and cBaseQC writes them as counts. The order of the statistics are also different.

The baseSum option can be used with either pBaseQC or cBaseQC or on its own. baseSum generates a summary of the per position statistics and writes it to stderr. It calculates the per position base statistics even if they will not be written anywhere (neither pBaseQC nor cBaseQC are specified).

All three options use the same logic for calculating the statistics:

A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip. CIGAR operations allowed for the position are M/X/=/D/N. If the CIGAR is '*', only numbers for the specified reference position are incremented.
Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.

FASTX-toolkit - on fasta or fastq files

BAMStats: summarising Next Generation Sequencing alignments

http://bamstats.sourceforge.net/

Mapping is a prerequisite for most next generation sequencing workflows and the SAM/BAM file format (1) is the de facto standard for storing such large sequence alignments. BAMStats, is a simple software tool built on the Picard Java API (2), which can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.

2012年9月23日星期日

SNPRelate and gdsfmt - R package for NGS and GWAS

gdsfmt and SNPRelate are high-performance computing R packages for multi-core symmetric multiprocessing computer architectures. They are used to accelerate two key computations is GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of our algorithms are written in C/C++, and have been highly optimized. Benchmarks show the uniprocessor implementations of PCA and IBD are ~10 to 45 times faster than the implementations provided by the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs respectively, and can be sped up to 70~250 folds by utilizing eight cores. SNPRelate can analyze tens of thousands of samples, with millions of SNPs.

http://corearray.sourceforge.net/tutorials/SNPRelate/

consensus calls (SNP/Indel/Reference) from a mpileup file

http://varscan.sourceforge.net/using-varscan.html#v2.3_mpileup2cns

http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html

mpileup2cns

This command makes consensus calls (SNP/Indel/Reference) from a mpileup file based on user-defined parameters:

 USAGE: java -jar VarScan.jar mpileup2cns [mpileup file] OPTIONS
        mpileup file - The SAMtools mpileup file

 OPTIONS:

 --min-coverage Minimum read depth at a position to make a call [8]
 --min-reads2 Minimum supporting reads at a position to call variants [2]
 --min-avg-qual Minimum base quality at a position to count a read [15]
 --min-var-freq Minimum variant allele frequency threshold [0.01]
 --min-freq-for-hom Minimum frequency to call homozygote [0.75]
 --p-value Default p-value threshold for calling variants [99e-02]
 --strand-filter Ignore variants with >90% support on one strand [1]
 --output-vcf If set to 1, outputs in VCF format
 --variants Report only variant (SNP/indel) positions (mpileup2cns only) [0]

  
 OUTPUT
 Tab-delimited SNP calls with the following columns:
 Chrom  chromosome name
 Position position (1-based)
 Ref   reference allele at this position
 Var   variant allele observed
 PoolCall Cross-sample call using all data (Cons:Cov:Reads1:Reads2:Freq:P-value)
    Cons - consensus genotype in IUPAC format
    Cov - total depth of coverage
    Reads1 - number of reads supporting reference
    Reads2 - number of reads supporting variant
    Freq - the variant allele frequency by read count
    P-value - FET p-value of observed reads vs expected non-variant
 StrandFilt Information to look for strand bias using all reads, format R1+:R1-:R2+:R2-:pval
    R1+ = reference supporting reads on forward strand
    R1- = reference supporting reads on reverse strand
    R2+ = variant supporting reads on forward strand
    R2- = variant supporting reads on reverse strand
    pval = FET p-value for strand distribution, R1 versus R2
 SamplesRef Number of samples called reference (wildtype)
 SamplesHet Number of samples called heterozygous-variant
 SamplesHom Number of samples called homozygous-variant
 SamplesNC Number of samples not covered / not called
 SampleCalls The calls for each sample in the mpileup, space-delimited
       Each sample has six values separated by colons:
   Cons - consensus genotype in IUPAC format
   Cov - total depth of coverage
   Reads1 - number of reads supporting reference
   Reads2 - number of reads supporting variant
   Freq - the variant allele frequency by read count
   P-value - FET p-value of observed reads vs expected non-variant

2012年9月17日星期一

tips on installing EggLib on linux

The following lines from the author of EggLib (http://egglib.sourceforge.net/), Stéphane.

#############################
But first, check if you have the "cmake" program installed on the server. If so, the preferred approach is to issue the following commands in each of bpp-core, bpp-seq and bpp-popgen (in that order, and the other libraries are not needed by egglib).

$ cmake -DCMAKE_INSTALL_PREFIX=~/local
$ make install

(where ~/local is a directory in you local home folder).

For installing egglib locally, I have managed to do it several times myself, and I found my trick to be good enough at the moment. Please check the following tutorial:

Installation of egglib on a server without administration rights.

1) The server must be POSIX-compliant (as are Linux, MacOS, Solaris).

2) Assume the user home path is: /home/alfred (shortcut as ~).
   Replace it by whatever is the correct path. Don't use the "~"
   shortcut when invoking "python setup.py".

3) Create a `local` directory for installed software.
   Commonly, one will use ~/.local, but here we will use ~/local.

    $ mkdir ~/local

4) Get the needed source archives within the `local` directory:
    - egglib-cpp-2.1.3.tar.gz
    - eggcoal-2.1.3.tar.gz
    - eggstats-2.1.3.tar.gz
    - egglib-py-2.1.3.tar.gz
    (from http://sourceforge.net/projects/egglib/files/current/

)

    and decompress them here.

5) Go to egglib-cpp-2.1.3 and type the following commands:

   $ ./configure --prefix=/home/alfred
   $ make
   $ make install

6) If you need them, go to eggcoal and eggstats directories and type
   the following commands:

   $ ./configure --prefix=/home/alfred/local LDFLAGS=-L/home/alfred/local/lib CPPFLAGS=-I/home/alfred/local/include
   $ make
   $ make install

7) Go to egglib-py and type the following:

   $ python setup.py build --prefix=/home/alfred/local
   $ python setup.py build_apps --prefix=/home/alfred/local

   The first command should take some time while it is compiling.

8) It is not possible to install egglib locally, but you can workaround
   to get the python module operating.

    $ mkdir ~/local/python
    $ cp -R build/lib-linux-*/* ~/local/python
    $ cp -R build/scripts-*/egglib ~/local/python/egglibrun

9) EggLib is now locally installed, but Python is not aware of it. If
   you need to write or run python scripts using egglib, you need to
   start the files with the following:

   import sys
   sys.path.append('/home/alfred/local/python')
   import egglib

tricks to install libsequence and its sub programs in Ubuntu linux

1. the important reference for this operation should be this link from Ross-Ibarra Lab:
http://rilabwiki.ucdavis.edu/doku.php?id=installing_libsequence

2. However, when you install GSL, you will sometimes meet problems. Then you could go to this link to fix the problem by installation of some GSL dependences.

http://beans.seartipy.com/2006/03/15/installing-c-boost-on-gentoo-and-debianubuntu /

######################
# make sure which version of libboost available (in the example 1.33.1), and substitute "1.33.1" with the right version code (currently, "1.41").

sh# apt-get install libboost-date-time-dev libboost-date-time1.33.1 libboost-dev libboost-doc libboost-filesystem-dev libboost-filesystem1.33.1 libboost-graph-dev libboost-graph1.33.1 libboost-iostreams-dev libboost-iostreams1.33.1 libboost-program-options-dev libboost-program-options1.33.1 libboost-python-dev libboost-python1.33.1 libboost-regex-dev libboost-regex1.33.1 libboost-signals-dev libboost-signals1.33.1 libboost-test-dev libboost-test1.33.1 libboost-thread-dev libboost-thread1.33.1

Or,

sh# apt-get install libboost.*-dev libboost-doc libboost.*1.33.1

3. You need to specify the right PATH to apply "sudo ./bjam install", when you are install boost library.

# add the following two lines to you ".bash_profile" file

export PATH="$PATH:/home/jmao/Downloads/boost_1_51_0"
export PATH="$PATH:/home/jmao/Downloads/boost_1_51_0/stage/lib"

2012年9月15日星期六

The UCSC genome browser and associated tools

http://bib.oxfordjournals.org/content/early/2012/08/18/bib.bbs038.full

The UCSC Genome Browser (http://genome.ucsc.edu) is a graphical viewer for genomic data now in its 13th year. Since the early days of the Human Genome Project, it has presented an integrated view of genomic data of many kinds. Now home to assemblies for 58 organisms, the Browser presents visualization of annotations mapped to genomic coordinates. The ability to juxtapose annotations of many types facilitates inquiry-driven data mining. Gene predictions, mRNA alignments, epigenomic data from the ENCODE project, conservation scores from vertebrate whole-genome alignments and variation data may be viewed at any scale from a single base to an entire chromosome. The Browser also includes many other widely used tools, including BLAT, which is useful for alignments from high-throughput sequencing experiments. Private data uploaded as Custom Tracks and Data Hubs in many formats may be displayed alongside the rich compendium of precomputed data in the UCSC database. The Table Browser is a full-featured graphical interface, which allows querying, filtering and intersection of data tables. The Saved Session feature allows users to store and share customized views, enhancing the utility of the system for organizing multiple trains of thought. Binary Alignment/Map (BAM), Variant Call Format and the Personal Genome Single Nucleotide Polymorphisms (SNPs) data formats are useful for visualizing a large sequencing experiment (whole-genome or whole-exome), where the differences between the data set and the reference assembly may be displayed graphically. Support for high-throughput sequencing extends to compact, indexed data formats, such as BAM, bigBed and bigWig, allowing rapid visualization of large datasets from RNA-seq and ChIP-seq experiments via local hosting.

HighSSR: High throughput SSR characterization and locus development from next gen sequencing data

http://bioinformatics.oxfordjournals.org/content/early/2012/09/06/bioinformatics.bts524.abstract

http://code.google.com/p/highssr/

VarB: A Variation Browsing and analysis tool for variants derived from next-generation sequencing data

http://www.pathogenseq.org/varb

http://bioinformatics.oxfordjournals.org/content/early/2012/09/13/bioinformatics.bts557.abstract

2012年9月13日星期四

UCSC Source Tree Utilities

1.
programs for sorting, splitting, or merging fasta sequences;
record parsing and data conversion using GenBank, fasta, nib, and blast data formats;
sequence alignment;
motif searching;
hidden Markov model development; and much more.

Library subroutines are available for everything from managing C data structures such as linked lists, balanced trees, hashes, and directed graphs to developing routines for SQL, HTML, or CGI code. Additional library functions are available for biological sequence and data manipulation tasks such as reverse complementation, codon and amino acid lookup and sequence translation, as well as functions specifically designed for extracting, loading, and manipulating data in the UCSC Genome Browser Databases.

2. install it in Debian/Ubuntu
http://genomewiki.ucsc.edu/index.php/Source_tree_compilation_on_Debian/Ubuntu

3.

Compiling UCSC Source Tree Utilities on OSX

4. http://iamphioxus.org/2012/05/17/build-ucsc-kent-utilities-on-rice-sugr-cluster/

1) MySQL is the prerequisite for our building. Although it is installed on Rice SUG@R cluster, I cannot find the file “libmysqlclient.a” on SUG@R, which will be used in UCSC Kent Utilities building. So I have to installed a copy of MySQL in my home directory. The newest version of MySQL is 5.5.24 and it can be downloaded here. I downloaded the source code version for Generic Linux (Architecture Independent). The current version of MySQL need CMake to compile, so before we can do anything with MySQL, we need to have to back up a step and install CMake first.

1> wget http://www.cmake.org/files/v2.8/cmake-2.8.8.tar.gz

2> tar -xzvf cmake-2.8.8.tar.gz

3> cd cmake-2.8.8

4> ./configure --prefix=/users/NetID/local/

5> gmake

6> make install

Then install MySQL using CMake:

1> tar xvzf mysql-5.5.24.tar.gz

2> cd mysql-5.5.24

3> mkdir -p /users/NetID/local/mysql

4> cmake -DCMAKE_INSTALL_PREFIX=/users/NetID/local/mysql

5> make

6> make install

Of course, we still need more work to configure MySQL to let it work well. But since we have MySQL support on SUG@R already and only need a file from this installation, so we can take care of those configuration steps later when we actually need to run our local copy of MySQL.

2) Now we can finally work on building UCSC Kent Utilities. Check the shell environment first. We need to change the value of MACHTYPE to x86_64 for our build and also set this in the .bashrc file under my home directory.

1> echo $MACHTYPE

2> x86_64-redhat-linux-gnu

3> MACHTYPE=x86_64

4> emacs .bashrc

5> export MACHTYPE=x86_64 # add this line into the .bashrc file

3) Under the $Home/bin/ directory, create a directory named x86_64 for those binary files generated during our compiling.

`1`	`>` `mkdir` `-p $HOME/bin/${MACHTYPE}`

4) Create the MySQL shell environment variables we need:

1> MYSQLINC=/usr/include/mysql

2> MYSQLLIBS="/users/NetID/local/mysql/lib/libmysqlclient.a -lz"

3> export MYSQLINC MYSQLLIBS

5) Now we are ready to compile those utilities. Unzip the UCSC source code we downloaded and we will see a resulting directory named kent.

1> unzip jksrc.zip

2> cd kent/src

3> make libs

4> cd utils/

5> make

6) Then we are done. All those complied binary utilities can be found in $HOME/bin/$MACHTYPE directory. For me, it is /users/NetID/bin/x86_64/. Just try one, say faSplit. After typing faSplit in shell, we see the usage for this utility. So our build works!

01> cd $HOME/bin/$MACHTYPE

02> ./faSplit

03> faSplit - Split an fa file into several files.

04> usage:

05>   faSplit how input.fa count outRoot

06> where how is either 'about' 'byname' 'base' 'gap' 'sequence' or'size'.

07> Files split by sequence will be broken at the nearest fa record boundary.

08> Files split by base will be broken at any base.

09> Files broken by size will be broken every count bases.

10...

订阅：博文 (Atom)