evolving all we are: tips of NGS data processing

2012年9月27日星期四

tips of NGS data processing

http://crocea.mednet.ucla.edu/research/variation/next-gen-sequence-how-to
one referene genome in one fasta file

FileFormatExchange.combineMultiFastaIntoOne() in variation.src.misc

bwa index

bwa index -a bwtsw hsRef.fa

"bwtsw" is for genome >2Gb

move all alignment short reads into one directory

for all subspecies

for j in aethiops cynosurus pygerythrus sabaeus tantalus; do echo $j; for i in `find /u/home/eeskin/namtran/Experiment/Genome/Freimer/Chlorocebus/INPUT/$j/ -name sequence*.tar.gz -print`; do echo $i; ln -s $i ./$j ; done; done

for sequences stored in bam format on WUSTL server

run shell/wgetGivenURL.sh to fetch sequence bam files from WUSTL
run picard to convert bam to paired-end fastq

java -jar ~/script//vervet/bin/picard-tools/SamToFastq.jar INPUT=gerald_62FGFAAXX_3.bam F=gerald_62FGFAAXX_3_1.fastq F2=gerald_62FGFAAXX_3_2.fastq

bwa alignment (aln or bwtsw; single-end or paired-end)

vervetdb.src.MpiBWA.py including these steps

align

input fastq files can be in plain text or gzipped.
NOTE: For bwtsw, one read could be found in >1 alignments in final bam file. i.e. 212904 unique reads among 217507 mapped reads, max redundant read count=4. ~2% are found to be in >1 alignments.

convert sam to bam

samtools view -F 4 -Sbh alignment.sam > alignment.bam

"-F 4" instructs samtools to remove unmapped reads

merge all bam files into one bam (have to be sorted if "bwa sort" is used)

perl -e 'print "\@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n\@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt

This is to tag the 454 and illumina results by read group (RG). Usually not necessary.

samtools merge -rh rg.txt - *.bam | samtools rmdup - - | samtools rmdup -s - aln.bam

duplicates removal for single-end reads is not recommended given very deep coverage as this step may discard too many good alignments.
"-rh rg.txt" is again to add RG header into the final bam file. Without it, the command is: samtools merge merged.bam *.bam

samtools merge - *.bam | samtools view -bh -F 4 - -o aln.bam

merge all bam files and remove unaligned reads
WATCH "-" in "samtools merge". It instructs "merge" to output to stdout.
If read group (RG) tags are present in original bam files, samtools merge only takes RG info in the header from the first bam file.

picard merge would combine headers from all bam files, rather than the header from the first bam.

java -jar ~/script//vervet/bin/picard-tools/MergeSamFiles.jar INPUT=ref_illu_vs_1MbBAC_by_aln.RG.bam INPUT=454_vs_ref_1MbBAC.F4.RG.bam INPUT=cynosurus_vs_1MbBAC_by_aln.RG.bam USE_THREADING=true ASSUME_SORTED=true OUTPUT=454_illu_6_sub_vs_1MbBAC.RG.bam

remove unmapped reads from the bam file:

samtools view -F 4 -b -h -o 454_vs_ref_1MbBAC_c30.F4.bam 454_vs_ref_1MbBAC_c30.bam

select one region out of whole-genome bam and sort it as well

crocea@mahogany:/Network/Data/vervet/ref$ samtools view -u 454_vs_hg19_20101230.bam 9606_chr9:124,000,000-124,210,000 | samtools sort - 454_vs_hg19_20101230.chr9_124M_124.2M

option -u renders samtools to output uncompressed BAM to reduce the computing time.

convert bam into sam for checking

samtools view -h alignment.bam > alignment.sam

"-h" : include the header in the output

view alignment through samtools tview

crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr9_124M_124.2M.bam
crocea@mahogany:/Network/Data/vervet/ref$ samtools tview 454_vs_hg19_20101230.chr9_124M_124.2M.bam ../../NCBI/hs_genome.fasta

view alignment through IGV

select "Import Genome ..." to import a fasta file and create a GenomeName.genome (i.e. myHG19.genome) file

Fetch the gene annotation file in BED format from the UCSC Browser (table browser). choose a genome (i.e. hg19) as assembly
choose group: Genes and Gene Prediction tracks,
choose track: UCSC Genes (more isoforms, less accurate) or RefSeqGenes (less isoforms, more accurate)
choose: output format: BED (or any other format you'd like, but BED can be displayed within the genome browser)
enter some file name in the text box below and click get output,
you then get referred to another page, where you may choose to get all exons, additional intron coordinates, just coding exons, 5'UTR exons, ...)

index the bam file first (required)

crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam

crocea@banyan:/Network/Data/vervet/ref$ java -jar /home/crocea/script/vervet/bin/IGVTools/igvtools.jar count 454_vs_hg19_20101230.bam 454_vs_hg19_20101230.depth ../../NCBI/myHG19.genome

variant calling through samtools

samtools mpileup -D -d 40 -ugf .../NCBI/hs_genome.fasta 454_vs_hg19_20101230.chr9_124M_124.2M.bam 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam 454_vs_hg19_20101230.chr20_4.085M_4.231M.bam | bcftools view -bvcg -p 1 - > 454_vs_hg19_20101230_3eQTL.raw.bcf

"-d 40" instructs mpileup to, at a position, read maximally INT reads per input BAM. [250]
"-p 1" : -p FLOAT variant if P(ref|D)
"-D" instructs mpileup to output per-sample read depth.
If each bam has read-group added, samtools would use that to mark genotypes from that sample. If not, filename is used as replacement.
samtools output genotype columns in the order of input bams.

crocea@banyan:/Network/Data/vervet/ref$ bcftools view 454_vs_hg19_20101230_3eQTL.raw.bcf | vcfutils.pl varFilter -D100 > 454_vs_hg19_20101230_3eQTL_D100.raw.vcf
Note 2011-9-21: samtools assigns 0/0 to zero-read-coverage loci while GATK just assigns ./. (missing).

variant calling through GATK

add a read group to one bam file (if multiple, could use "samtools merge" exemplified above) http://seqanswers.com/forums/showthread.php?t=4180

echo -e "@RG\tID:454\tSM:hs\tLB:454\tPL:454" > rg.txt
samtools view -h 454_vs_hg19.3eQTL.bam | cat rg.txt - | awk '{ if (substr($1,1,1)=="@") print; else printf "%s\tRG:Z:454\n",$0; }' | samtools view -bhS - > 454_vs_hg19.3eQTL.RG454.bam
alternatives include "~/script/vervet/src/addRGToBAM.sh", AddOrReplaceReadGroups.jar from picard

create the fasta sequence dict file:

package from picard

 java -jar ~/script/vervet/bin/picard-tools/CreateSequenceDictionary.jar R= 1Mb-BAC.fasta O= 1Mb-BAC.dict

create the fasta index file

samtools faidx 1Mb-BAC.fasta

index the bam file

samtools index GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam

make sure the header of bam file contains the same set of "SQ"s (reference sequences) as the fasta file. Otherwise GATK will either generate an empty vcf when bam header contains more than fasta file and the 1st fasta record is not the first SQ in header or complain some sequence number (for reference sequence) is not found when fasta file contains more.

make sure reads in this bam file are all aligned to references from this set of "SQ"s
For PE reads only, throw away reads whose mates are not based on this set of "SQ"s, otherwise, this bam file would be corrupted.
use picard's ValidateSamFile.jar to validate (although not everything reported is fatal)

user@mahogany:/Network/Data/vervet/ref$ java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam -R ~/script/vervet/data/ref/BAC/1Mb-BAC.fa -T UnifiedGenotyper --out 454_vs_ref_1MbBAC.F4.GATK.vcf -U -S SILENT

-S,--validation_strictness How strict should we be with validation (STRICT|LENIENT| SILENT)
-U,--unsafe If set, enables unsafe operations: nothing will be checked at runtime. For expert users only who know what they are doing. We do not support usage of this argument. (ALLOW_UNINDEXED_BAM| ALLOW_EMPTY_INTERVAL_LIST| ALLOW_UNSET_BAM_SORT_ORDER|NO_READ_ORDER_VERIFICATION| ALLOW_SEQ_DICT_INCOMPATIBILITY|ALL)
-nt,--num_threads How many threads should be allocated to running this analysis.
GATK outputs genotype columns in the order of read-groups, rather than order of "-I" when you have multiple input BAM files.
The order of genotype columns in vcf matters because vcf-isect (intersection) from vcftools requires input VCF files have the same set of genotype columns and they must be in the same order.

base quality score re-calibration

java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -B:mask,VCF 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.GATK.vcf
java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T TableRecalibration --out 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal.bam -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv

analyze quality covariates by GATK's AnalyzeCovariates.jar

analyze original data

run CountCovariates on the orignal bam file
java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/AnalyzeCovariates.jar -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -resources ~/script/vervet/bin//GenomeAnalysisTK-1.0.4705/resources -outputDir beforeCalibration/

analyze GATK-calibrated data

run CountCovariates on the table-re-calibrated bam file
run AnalyzeCovariates.jar as above

Calculate coverage per sample, per library, per readgroup , per interval

java -jar ~/script/gatk/dist/GenomeAnalysisTK.jar -T DepthOfCoverage -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -I /Network/Data/vervet/db/individual_alignment/592_52_vs_524_by_2.bam -I /Network/Data/vervet/db/individual_alignment/637_106_vs_524_by_2.bam -pt sample -o /Network/Data/vervet/vervetPipeline/tmp/aln_592_637_depthOfCoverage --omitDepthOutputAtEachBase -L /tmp/contig.interval_list
the interval file contains this on each line (position is one-based. bed format doesn't work. http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK#Intervals is not true.):

Contig1:1-13722612

Among output, PREFIX.sample_summary, PREFIX.sample_interval_summary are important. Replace "sample" with readgroup or library if other parition type is selected.
Could handle multiple partition types via adding more "-pt"s. Like "-pt readgroup -pt sample -pt library".

select variants from VCF input and output in VCF

http://www.broadinstitute.org/gsa/wiki/index.php/SelectVariants
select by any expression (watch: DP or QD or other attributes are taken from the "INFO" field, not individual sample's genotype. For per-sample filtering, first want ).

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC'
   -select "QD > 10.0"

select only INDEL

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType INDEL

many more

Combine variants from various VCF through GATK's CombineVariants

combine two single-sample VCF into one
java -Xmx2g -jar /home/crocea/script/gatk/dist/GenomeAnalysisTK.jar -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -T CombineVariants -o tmp/CombineSAMtools555_556.vcf -V:foo tmp/SelectSAMtoolsContig0_555_15_1987079_GA_vs_524.vcf -V:foo1 tmp/SelectSAMtoolsContig0_556_16_1985088_GA_vs_524.vcf
more

没有评论:

发表评论

订阅：博文评论 (Atom)