Link to home
Create AccountLog in
Avatar of deharvy
deharvy

asked on

Unix Script to Remove Duplicate Lines (with one unique column)

A file gets generated for me that provides me with an output similar to this:

#Player#,BB,N,1001,Albert,Cardinals
#Player#,BB,N,1002,Albert,Cardinals
#Player#,BB,N,1003,Albert,Cardinals
#Player#,BB,A,2001,Alex,Yankees
#Player#,BB,A,2002,Alex,Yankees
#Player#,BB,A,2003,Alex,Yankees
#Player#,BB,A,2004,Alex,Yankees
#Player#,BB,A,3001,Adrian,Red Sox
#Player#,BB,A,3002,Adrian,Red Sox
#Player#,BB,A,3003,Adrian,Red Sox
#Player#,BB,A,3004,Adrian,Red Sox
#Player#,BB,A,3005,Adrian,Red Sox

You can see that there are duplicate lines, however, there is one column; the 4th, that has a unique number.

I need a script that would be smart enough to ignore the unique number in the 4th column and then remove any duplicate lines. It doesn't matter to me whether the first instance is kept or the last, but the output should come out like this:

#Player#,BB,N,1001,Albert,Cardinals
#Player#,BB,A,2001,Alex,Yankees
#Player#,BB,A,3001,Adrian,Red Sox

or

#Player#,BB,N,1003,Albert,Cardinals
#Player#,BB,A,2004,Alex,Yankees
#Player#,BB,A,3005,Adrian,Red Sox

I only care about having one truly unique line.

Any help would be greatly appreciated.
SOLUTION
Avatar of Steve Tempest
Steve Tempest
Flag of Australia image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
ASKER CERTIFIED SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of deharvy
deharvy

ASKER

The uniq doesn't seem to work for me:

uniq: illegal option -- w
Usage:  uniq [-c|-d|-u][-f fields][-s char] [input_file [output_file]]
Or:     uniq [-c|-d|-u][-n][+m] [input_file [output_file]]

----------------

The perl didn't work either with the output I have.

Can you explain what this means? [0-2,4-5]
Hi,

which version of uniq are you using?

Can you please run

uniq --help

and sent the output to me.

Also I tested ozo's method here and it works perfectly

perl -F, -ane 'print unless $seen{"@F[0-2,4-5]"}++' test1.txt
#Player#,BB,N,1001,Albert,Cardinals
#Player#,BB,A,2001,Alex,Yankees
#Player#,BB,A,3001,Adrian,Red Sox

Answer to your question. The 0-2 and 4-5 fields are compared for uniqueness and field 3 which is the number is ignored.

SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of deharvy

ASKER

You guys are awesome. Just learned something. Thanks so much. uniq -s works as well as the perl script now that I understand the logic and updated the code.
Avatar of deharvy

ASKER

Thanks so much!