2012年7月5日星期四

Merge multiple tables into big matrix


http://www.unix.com/shell-programming-scripting/189735-merge-multiple-tables-into-big-matrix.html

I have a complex (beyond my biological expertise) problem at hand.
I need to merge multiple files into 1 big matrix. Please help me with some code.


Inp1

Code:
Ang_0    chr1    98    T    A    
Ang_0    chr1    352    G    A    
Ang_0    chr1    425    C    T    
Ang_0    chr2    471    T    G    
Ang_0    chr2    508    T    -


Inp2

Code:
Bng_0    chr1    98    T    G    
Bng_0    chr1    352    G    A        
Bng_0    chr2    471    T    A    
Bng_0    chr2    508    T    -

Inp3

Code:
Cng_0    chr1    198    T    A    
Cng_0    chr1    352    G    A    
Cng_0    chr1    425    C    T    
Cng_0    chr2    471    T    G


Outp

Code:
            
           Ang_0    Bng_0 Cng_0    
chr1    98    A    G    T
chr1    198    T    T    A
chr1     352    A    A    A
chr1    425    T    C    T
chr2    471    G    A    G
chr2    508    -    -    T



Input files have 5 columns, 1=organism name, 2=chormosome number, 3=chromosome position,4=reference,5=Alternate

First columns 2 and 3 have to be matched in all the input files, if all files have a record for a particular column2 and 3 value
then column5 value has to be outputted. If an input file does not have a record matching a particular column2 and 3 values, column 
value from any of the input files having that record has to be printed in the outputThe column names in the output files 
will be the organism name (column 1 of input files.)

I have 123 files and ~30,000 rows in each file. 
So the output will have 125 columns, columns 3 through 125 are organism names, column 1 and 2 are chromosome and position.


#########
Give this a go:


Code:
awk '!($1 in colnum) {Title[++col]=$1;colnum[$1]=col}
{i=colnum[$1]
 def[$2,$3]=$4
 Val[i,$2,$3]=$5
}
END{ $0=""
    for(i=1;i<=col;i++) $(i+2)=Title[i]
    print
    for(v in def) {
       $0=""
       split(v, f, SUBSEP)
       $1=f[1]
       $2=f[2]
       for(i=1;i<=col;i++) {
          if((i SUBSEP f[1] SUBSEP f[2]) in Val) $(i+2)=Val[i,f[1],f[2]]
          else $(i+2)= def[f[1],f[2]]
       }
       print
    }
}' OFS='\t' inp* > result_file

没有评论:

发表评论