http://biostar.stackexchange.com/questions/6615/how-to-efficiently-remove-snps-that-are-present-in-all-samples
(1)
cut -d ' ' -f 1-5 *.vcf |\ #chrom,pos,id,ref,alt
egrep -v "^#" |\ #remove the header
sort |\
uniq -c |\ # -c is for 'count'
egrep "^ 5 " |\ #keep the mutations present 5 times
cut -c 9- > snp.txt #remove the count
you can the use this snp.txt to filter out your VCF with grep -f snp.txt -v sample1.vcf > sample1.filtered.vcf
(it could be slow for a large number of snps) or by using unix 'join -v'
(faster , but you'll need to create a extra column in your VCFs to create a uniq key(chrom/position/ref/alt)(2)
You can use vcftools to achieve this:
- Start by finding those positions that occur in all the files (replace '3' by the actual number of files)
vcf-isec -o -n =3 A.vcf.gz B.vcf.gz C.vcf.gz (...) | bgzip -c > intersect.vcf.gz - Exclude from A positions which appear in the intersection
vcf-isec -c A.vcf.gz intersect.vcf.gz > newA.vcf
没有评论:
发表评论