deharvy
asked on
Unix Script to Remove Duplicate Lines (with one unique column)
A file gets generated for me that provides me with an output similar to this:
#Player#,BB,N,1001,Albert, Cardinals
#Player#,BB,N,1002,Albert, Cardinals
#Player#,BB,N,1003,Albert, Cardinals
#Player#,BB,A,2001,Alex,Ya nkees
#Player#,BB,A,2002,Alex,Ya nkees
#Player#,BB,A,2003,Alex,Ya nkees
#Player#,BB,A,2004,Alex,Ya nkees
#Player#,BB,A,3001,Adrian, Red Sox
#Player#,BB,A,3002,Adrian, Red Sox
#Player#,BB,A,3003,Adrian, Red Sox
#Player#,BB,A,3004,Adrian, Red Sox
#Player#,BB,A,3005,Adrian, Red Sox
You can see that there are duplicate lines, however, there is one column; the 4th, that has a unique number.
I need a script that would be smart enough to ignore the unique number in the 4th column and then remove any duplicate lines. It doesn't matter to me whether the first instance is kept or the last, but the output should come out like this:
#Player#,BB,N,1001,Albert, Cardinals
#Player#,BB,A,2001,Alex,Ya nkees
#Player#,BB,A,3001,Adrian, Red Sox
or
#Player#,BB,N,1003,Albert, Cardinals
#Player#,BB,A,2004,Alex,Ya nkees
#Player#,BB,A,3005,Adrian, Red Sox
I only care about having one truly unique line.
Any help would be greatly appreciated.
#Player#,BB,N,1001,Albert,
#Player#,BB,N,1002,Albert,
#Player#,BB,N,1003,Albert,
#Player#,BB,A,2001,Alex,Ya
#Player#,BB,A,2002,Alex,Ya
#Player#,BB,A,2003,Alex,Ya
#Player#,BB,A,2004,Alex,Ya
#Player#,BB,A,3001,Adrian,
#Player#,BB,A,3002,Adrian,
#Player#,BB,A,3003,Adrian,
#Player#,BB,A,3004,Adrian,
#Player#,BB,A,3005,Adrian,
You can see that there are duplicate lines, however, there is one column; the 4th, that has a unique number.
I need a script that would be smart enough to ignore the unique number in the 4th column and then remove any duplicate lines. It doesn't matter to me whether the first instance is kept or the last, but the output should come out like this:
#Player#,BB,N,1001,Albert,
#Player#,BB,A,2001,Alex,Ya
#Player#,BB,A,3001,Adrian,
or
#Player#,BB,N,1003,Albert,
#Player#,BB,A,2004,Alex,Ya
#Player#,BB,A,3005,Adrian,
I only care about having one truly unique line.
Any help would be greatly appreciated.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Hi,
which version of uniq are you using?
Can you please run
uniq --help
and sent the output to me.
Also I tested ozo's method here and it works perfectly
perl -F, -ane 'print unless $seen{"@F[0-2,4-5]"}++' test1.txt
#Player#,BB,N,1001,Albert, Cardinals
#Player#,BB,A,2001,Alex,Ya nkees
#Player#,BB,A,3001,Adrian, Red Sox
Answer to your question. The 0-2 and 4-5 fields are compared for uniqueness and field 3 which is the number is ignored.
which version of uniq are you using?
Can you please run
uniq --help
and sent the output to me.
Also I tested ozo's method here and it works perfectly
perl -F, -ane 'print unless $seen{"@F[0-2,4-5]"}++' test1.txt
#Player#,BB,N,1001,Albert,
#Player#,BB,A,2001,Alex,Ya
#Player#,BB,A,3001,Adrian,
Answer to your question. The 0-2 and 4-5 fields are compared for uniqueness and field 3 which is the number is ignored.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
You guys are awesome. Just learned something. Thanks so much. uniq -s works as well as the perl script now that I understand the logic and updated the code.
ASKER
Thanks so much!
ASKER
uniq: illegal option -- w
Usage: uniq [-c|-d|-u][-f fields][-s char] [input_file [output_file]]
Or: uniq [-c|-d|-u][-n][+m] [input_file [output_file]]
----------------
The perl didn't work either with the output I have.
Can you explain what this means? [0-2,4-5]