2011年5月16日星期一

trim VCF file by the times of currence of a SNP

http://biostar.stackexchange.com/questions/6615/how-to-efficiently-remove-snps-that-are-present-in-all-samples 
 
(1) 
cut -d '   ' -f 1-5 *.vcf |\ #chrom,pos,id,ref,alt
egrep -v "^#" |\ #remove the header
sort |\
uniq -c |\ # -c is for 'count'
egrep "^      5 " |\ #keep the mutations present 5 times
cut -c 9- > snp.txt #remove the count
you can the use this snp.txt to filter out your VCF with
grep -f snp.txt -v sample1.vcf > sample1.filtered.vcf
(it could be slow for a large number of snps) or by using unix 'join -v' (faster , but you'll need to create a extra column in your VCFs to create a uniq key(chrom/position/ref/alt)

(2)
You can use vcftools to achieve this:
  1. Start by finding those positions that occur in all the files (replace '3' by the actual number of files)
    vcf-isec -o -n =3 A.vcf.gz B.vcf.gz C.vcf.gz (...) | bgzip -c > intersect.vcf.gz
  2. Exclude from A positions which appear in the intersection
    vcf-isec -c A.vcf.gz intersect.vcf.gz > newA.vcf

没有评论:

发表评论