2012年9月27日星期四

tips of NGS data processing

  1. http://crocea.mednet.ucla.edu/research/variation/next-gen-sequence-how-to
  2.  one referene genome in one fasta file
    1. FileFormatExchange.combineMultiFastaIntoOne() in variation.src.misc
  3. bwa index
    1. bwa index -a bwtsw hsRef.fa
      • "bwtsw" is for genome >2Gb
  4. move all alignment short reads into one directory
    1. for all subspecies
      • for j in aethiops cynosurus pygerythrus sabaeus tantalus; do echo $j; for i in `find /u/home/eeskin/namtran/Experiment/Genome/Freimer/Chlorocebus/INPUT/$j/ -name sequence*.tar.gz -print`; do echo $i; ln -s $i ./$j ; done; done
    2. for sequences stored in bam format on WUSTL server
      1. run shell/wgetGivenURL.sh to fetch sequence bam files from WUSTL
      2. run picard to convert bam to paired-end fastq
        • java -jar ~/script//vervet/bin/picard-tools/SamToFastq.jar INPUT=gerald_62FGFAAXX_3.bam F=gerald_62FGFAAXX_3_1.fastq F2=gerald_62FGFAAXX_3_2.fastq
  5. bwa alignment (aln or bwtsw; single-end or paired-end)
    1. vervetdb.src.MpiBWA.py including these steps
      1. align
        1. input fastq files can be in plain text or gzipped.
        2. NOTE: For bwtsw, one read could be found in >1 alignments in final bam file. i.e. 212904 unique reads among 217507 mapped reads, max redundant read count=4. ~2% are found to be in >1 alignments.
      2. convert sam to bam
        1. samtools view -F 4 -Sbh alignment.sam > alignment.bam
          • "-F 4" instructs samtools to remove unmapped reads
    2. merge all bam files into one bam (have to be sorted if "bwa sort" is used)
      1. perl -e 'print "\@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n\@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
        • This is to tag the 454 and illumina results by read group (RG). Usually not necessary.
      2. samtools merge -rh rg.txt - *.bam | samtools rmdup - - | samtools rmdup -s - aln.bam
        • duplicates removal for single-end reads is not recommended given very deep coverage as this step may discard too many good alignments.
        • "-rh rg.txt" is again to add RG header into the final bam file. Without it, the command is: samtools merge merged.bam  *.bam
      3. samtools merge - *.bam | samtools view -bh -F 4 - -o aln.bam
        • merge all bam files and remove unaligned reads
        • WATCH "-" in "samtools merge". It instructs "merge" to output to stdout.
        • If read group (RG) tags are present in original bam files, samtools merge only takes RG info in the header from the first bam file.
      4. picard merge would combine headers from all bam files, rather than the header from the first bam.
        • java -jar ~/script//vervet/bin/picard-tools/MergeSamFiles.jar INPUT=ref_illu_vs_1MbBAC_by_aln.RG.bam INPUT=454_vs_ref_1MbBAC.F4.RG.bam INPUT=cynosurus_vs_1MbBAC_by_aln.RG.bam USE_THREADING=true ASSUME_SORTED=true OUTPUT=454_illu_6_sub_vs_1MbBAC.RG.bam
    3. remove unmapped reads from the bam file:
      1. samtools view -F 4 -b -h -o 454_vs_ref_1MbBAC_c30.F4.bam 454_vs_ref_1MbBAC_c30.bam
    4. select one region out of whole-genome bam and sort it as well
      1. crocea@mahogany:/Network/Data/vervet/ref$ samtools view -u 454_vs_hg19_20101230.bam 9606_chr9:124,000,000-124,210,000 | samtools sort - 454_vs_hg19_20101230.chr9_124M_124.2M
        • option -u renders samtools to output uncompressed BAM to reduce the computing time.
  6. convert bam into sam for checking
    1. samtools view -h alignment.bam > alignment.sam
      • "-h" : include the header in the output
  7. view alignment through samtools tview
    1. crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr9_124M_124.2M.bam
    2. crocea@mahogany:/Network/Data/vervet/ref$ samtools tview 454_vs_hg19_20101230.chr9_124M_124.2M.bam ../../NCBI/hs_genome.fasta
  8. view alignment through IGV
    1. select "Import Genome ..." to import a fasta file and create a GenomeName.genome (i.e. myHG19.genome) file
      1. Fetch the gene annotation file in BED format from the UCSC Browser (table browser). choose a genome (i.e. hg19) as assembly
        choose group: Genes and Gene Prediction tracks,
        choose track: UCSC Genes (more isoforms, less accurate) or RefSeqGenes (less isoforms, more accurate)
        choose: output format: BED (or any other format you'd like, but BED can be displayed within the genome browser)
        enter some file name in the text box below and click get output,
        you then get referred to another page, where you may choose to get all exons, additional intron coordinates, just coding exons, 5'UTR exons, ...)
    2. index the bam file first (required)
      • crocea@mahogany:/Network/Data/vervet/ref$ samtools index 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam
    3. crocea@banyan:/Network/Data/vervet/ref$ java -jar /home/crocea/script/vervet/bin/IGVTools/igvtools.jar count 454_vs_hg19_20101230.bam 454_vs_hg19_20101230.depth ../../NCBI/myHG19.genome
  9. variant calling through samtools
    1. samtools mpileup -D -d 40 -ugf .../NCBI/hs_genome.fasta      454_vs_hg19_20101230.chr9_124M_124.2M.bam 454_vs_hg19_20101230.chr13_31.67M_32.01M.bam 454_vs_hg19_20101230.chr20_4.085M_4.231M.bam | bcftools view -bvcg -p 1 - > 454_vs_hg19_20101230_3eQTL.raw.bcf
      • "-d 40" instructs mpileup to, at a position, read maximally INT reads per input BAM. [250]
      • "-p 1" : -p FLOAT variant if P(ref|D)
      • "-D" instructs mpileup to output per-sample read depth.
      • If each bam has read-group added, samtools would use that to mark genotypes from that sample. If not, filename is used as replacement.
      • samtools output genotype columns in the order of input bams.
    2. crocea@banyan:/Network/Data/vervet/ref$ bcftools view 454_vs_hg19_20101230_3eQTL.raw.bcf | vcfutils.pl varFilter -D100 > 454_vs_hg19_20101230_3eQTL_D100.raw.vcf
    3. Note 2011-9-21: samtools assigns 0/0 to zero-read-coverage loci while GATK just assigns ./. (missing).
  10. variant calling through GATK
    1. add a read group to one bam file (if multiple, could use "samtools merge" exemplified above) http://seqanswers.com/forums/showthread.php?t=4180
      • echo -e "@RG\tID:454\tSM:hs\tLB:454\tPL:454" > rg.txt
      • samtools view -h 454_vs_hg19.3eQTL.bam | cat rg.txt - | awk '{ if (substr($1,1,1)=="@") print; else printf "%s\tRG:Z:454\n",$0; }' | samtools view -bhS - > 454_vs_hg19.3eQTL.RG454.bam
      • alternatives include "~/script/vervet/src/addRGToBAM.sh", AddOrReplaceReadGroups.jar from picard
    2. create the fasta sequence dict file:
      • package from picard
       java -jar ~/script/vervet/bin/picard-tools/CreateSequenceDictionary.jar R= 1Mb-BAC.fasta O= 1Mb-BAC.dict
    3. create the fasta index file
      1. samtools faidx 1Mb-BAC.fasta
    4. index the bam file
      1. samtools index GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam
    5. make sure the header of bam file contains the same set of "SQ"s (reference sequences) as the fasta file. Otherwise GATK will either generate an empty vcf when bam header contains more than fasta file and the 1st fasta record is not the first SQ in header or complain some sequence number (for reference sequence) is not found when fasta file contains more.
      1. make sure reads in this bam file are all aligned to references from this set of "SQ"s
      2. For PE reads only, throw away reads whose mates are not based on this set of "SQ"s, otherwise, this bam file would be corrupted.
      3. use picard's ValidateSamFile.jar to validate (although not everything reported is fatal)
    6. user@mahogany:/Network/Data/vervet/ref$ java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -I 454_vs_ref_1MbBAC.F4.bam -R ~/script/vervet/data/ref/BAC/1Mb-BAC.fa -T UnifiedGenotyper --out 454_vs_ref_1MbBAC.F4.GATK.vcf -U -S SILENT
      • -S,--validation_strictness         How strict should we be with validation (STRICT|LENIENT| SILENT)
      • -U,--unsafe                                       If set, enables unsafe operations: nothing will be checked at runtime.  For expert users only who know what they are doing.  We do not support usage of this argument. (ALLOW_UNINDEXED_BAM| ALLOW_EMPTY_INTERVAL_LIST| ALLOW_UNSET_BAM_SORT_ORDER|NO_READ_ORDER_VERIFICATION| ALLOW_SEQ_DICT_INCOMPATIBILITY|ALL)
      • -nt,--num_threads                            How many threads should be allocated to running this analysis.
      • GATK outputs genotype columns in the order of read-groups, rather than order of "-I" when you have multiple input BAM files.
      • The order of genotype columns in vcf matters because vcf-isect (intersection) from vcftools requires input VCF files have the same set of genotype columns and they must be in the same order.
  11. base quality score re-calibration
    1. java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO  -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T CountCovariates   -cov ReadGroupCovariate    -cov QualityScoreCovariate    -cov CycleCovariate    -cov DinucCovariate -recalFile  454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -B:mask,VCF 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.GATK.vcf
    2. java -Xmx12g -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -l INFO  -R ../../../../NCBI/hs_genome.fasta -I 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.bam -T TableRecalibration  --out 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal.bam -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv
  12. analyze quality covariates by GATK's AnalyzeCovariates.jar
    1. analyze original data
      1. run CountCovariates on the orignal bam file
      2. java -jar ~/script/vervet/bin/GenomeAnalysisTK-1.0.4705/AnalyzeCovariates.jar  -recalFile 454_vs_hg19.3eQTL.minPerBaseAS0.4.minMapQ125.score2.recal_data.csv -resources ~/script/vervet/bin//GenomeAnalysisTK-1.0.4705/resources -outputDir beforeCalibration/
    2. analyze GATK-calibrated data
      1. run CountCovariates on the table-re-calibrated bam file
      2. run AnalyzeCovariates.jar  as above
  13. Calculate coverage per sample, per library, per readgroup , per interval
    1. java -jar ~/script/gatk/dist/GenomeAnalysisTK.jar -T DepthOfCoverage -R /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -I /Network/Data/vervet/db/individual_alignment/592_52_vs_524_by_2.bam -I /Network/Data/vervet/db/individual_alignment/637_106_vs_524_by_2.bam -pt sample -o /Network/Data/vervet/vervetPipeline/tmp/aln_592_637_depthOfCoverage --omitDepthOutputAtEachBase -L /tmp/contig.interval_list
    2. the interval file contains this on each line (position is one-based. bed format doesn't work. http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK#Intervals is not true.):
      1. Contig1:1-13722612
    3. Among output, PREFIX.sample_summary, PREFIX.sample_interval_summary are important. Replace "sample" with readgroup or library if other parition type is selected.
    4. Could handle multiple partition types via adding more "-pt"s. Like "-pt readgroup -pt sample -pt library".
  14. select variants from VCF input and output in VCF
    1. http://www.broadinstitute.org/gsa/wiki/index.php/SelectVariants
    2. select by any expression (watch: DP or QD or other attributes are taken from the "INFO" field, not individual sample's genotype. For per-sample filtering, first want ).
    3.  java -Xmx2g -jar GenomeAnalysisTK.jar \
         -R ref.fasta \
         -T SelectVariants \
         --variant input.vcf \
         -o output.vcf \
         -se 'SAMPLE.+PARC'
         -select "QD > 10.0"
    4. select only INDEL
    5.  java -Xmx2g -jar GenomeAnalysisTK.jar \
         -R ref.fasta \
         -T SelectVariants \
         --variant input.vcf \
         -o output.vcf \
         -selectType INDEL
    6. many more
  15. Combine variants from various VCF through GATK's CombineVariants
    1. combine two single-sample VCF into one
    2. java  -Xmx2g -jar /home/crocea/script/gatk/dist/GenomeAnalysisTK.jar -R  /Network/Data/vervet/db/individual_sequence/524_superContigsMinSize2000.fasta -T CombineVariants  -o tmp/CombineSAMtools555_556.vcf -V:foo tmp/SelectSAMtoolsContig0_555_15_1987079_GA_vs_524.vcf -V:foo1 tmp/SelectSAMtoolsContig0_556_16_1985088_GA_vs_524.vcf
    3. more 

没有评论:

发表评论