2011年4月7日星期四

discard rows of one file by comparing between two files in the first two columns

(1) the question
For example, I have two , delimited files (three columns each):

file_1
1,2,A
1,3,A
1,4,t
1,5,A
2,3,c
2,7,A
2,9,g
3,1,A
3,5,h
3,7,A

file_2
1,1,c
1,3,A
1,4,m
1,5,A
2,6,u
2,7,A
2,9,p
3,1,A
3,5,i
3,7,A


I want to discard the rows of file_1, these rows have the same records
in the two columns. After discarding, I have

1,2,A
2,3,c

(2) the solution

A.
all easier methods will assume the files are sorted ...

I would tempted to make the second comma a different delimiter so the
first two columns can become a single 'key'

sed 's/,/|/g2' file_1 > file_11
cut -f1 -d \| file_11 > file_1key
cut -f1,2 -d ',' file_2 > file_2key
comm -23 file_1key file_2key > file_3key
join -t \| file_11 file_3key

B.

# extract 'keys' of files 'a' and 'b' (keys = columns 1 and 2). Place
keys in new files
$ cut -d "," -f 1,2 a > a12
$ cut -d "," -f 1,2 b > b12

# treat file 'b12' as a list of patterns to use with grep. Ask grep to
show lines that *don't* match file a12. Store these unique keys in a
new file:
$ grep -vf b12 a12 > keys_a

# finally use this new file as a set of patterns for grep to extract
the equivalent lines from file 'a'
$ grep -f keys_a a
1,2,A
2,3,c

没有评论:

发表评论