2012年7月26日星期四

BioAwk - fasta, fastq, SAM, BED, GFF aware awk programming



Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.
Code
The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.
Documentation
There is a a short manual page in the main distribution and a longer HTML formatted help page
Examples
Extract unmapped reads without header:
awk -c sam 'and($flag,4)' aln.sam.gz
Extract mapped reads with header:
awk -c sam -H '!and($flag,4)'
Reverse complement FASTA:
awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz
Create FASTA from SAM (uses revcomp if FLAG & 16)::
samtools view aln.bam | \
    awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'
Get the %GC from FASTA:
awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz
Get the mean Phred quality score from FASTQ:
awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz
Take column name from the first line (where "age" appears in the first line of input.txt):
awk -c header '{ print $age }' input.txt

没有评论:

发表评论