2012年2月25日星期六

Useful Bash commands to handle FASTA files



#####################################################
(1) counting number of sequences in a fasta file:
grep -c "^>" file.fa
remove comments
sed -e 's/^\(>[^[:space:]]*\).*/\1/' my.fasta > mymodified.fasta
(2) add something to end of all header lines:
sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa
(3) clean up a fasta file so only first column of the header is outputted:
awk '{print $1}' file.fa > output.fa
(4) To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"
(5) A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.
sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'
(6) Remove duplicated sequences. Pierre Lindenbaum proposed this solution.
sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

(7) Splitting a FASTA file of multiple sequences into FASTA files of individual sequences

This command will create as many files as there are member sequences in the same directory as the source file, incrementally numbered with a .fasta extension. (e.g. for an input file with 5 member sequences, such as the Arabidopsis genome, it will output files 1.fasta to 5.fasta.
awk '/^>/{f=++d".fasta"} {print > f}' 

(8) Joining multiple FASTA files into a single, multi-sequence FASTA file

This is the reverse of the above and we will assume a few thingsFirstlyyou want to combine all fasta files in thecurrent directory andsecondlythey all have the same extension (.fasta). Adapt to your needs if this is not the case!
cat *.fasta > 

(10) List the sequence headers in a FASTA file

grep ">" 

(1) Counting the number of sequence entities in a FASTA file

grep ">"  | wc -l

(12) Determining the length of the sequence in a FASTA file

This method will give the TOTAL sequence length of a FASTA file. This means that if your FASTA file has a number of sequence entries, it will return the sum of the length of each sequence entry. To get the length of individual entries you would first need to split the file into individual entries, or do it programatically: either using a homegrown method or a Bioinformatics library such as BioPerl.
grep -v ">"  | tr -d [:space:] | wc -c

没有评论:

发表评论