Link to home
Start Free TrialLog in
Avatar of Severcorr
SevercorrFlag for United States of America

asked on

UNIX BASH

$ cat file1

field1                                               field2
ACT_MSTR                                      294705808
ACT_MSTR                                      294705808
CC_PROVIDERS                              2697541582
CC_PROVIDERS                              2697541581
CLOSED_WIN_JOB_CENTERS         3353753290
CLOSED_WIN_JOB_CENTERS         3353753290

if records of field1 are the same, compare the values of the 2 records based on field2. If values are not the same, print ONLY those records.

I would like my new file to be as follow:
CC_PROVIDERS                              2697541582
CC_PROVIDERS                              2697541581

THANKS!
SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Severcorr

ASKER

For some reasons, non of the commands were working. Thanks anyway.
Which linux are you using?
Which shell do you have?

Check whether you have all the command available egrep, awk, sort, uniq.
Here my unix version and those commands are there.

Linux WIAMigrationDB 2.6.18-164.11.1.el5.centos.plus #1 SMP Wed Jan 20 18:49:35 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
You should have all those tools.
look in  /var/log/rpmpkgs

yum install gawk grep coreutils
they should be available in the /bin/ directory
try this then:
/bin/egrep -iv "`/bin/awk ' { print $2 } ' file1| /bin/grep -v field|/bin/sort | /bin/uniq -c | /bin/sort -rn | /bin/awk ' ($1 >1 ) { print $2 } '`"
I installed the gawk grep coreutils and it still doesn't work. Did you create a test file and tested scripts on your side? Thanks.
Yes, I used a copy of the post you made. and named in file1. and when executing the commands, file1 was in the Current working directory i.e. ls file1 returned information about the file.

Could you post what errors you are getting?
I didn't get any error, but it came back with an empty value.
what are the contents of file1?
post the following

awk ' { print  NF } ' file1
Do all lines reflect/report 2?
I ran awk ' { print  NF } ' file1. It just printed out one character/number throughout. By the way what does awk ' ($1 >1' mean? Thanks.
I am sorry. When I tested the above file (e.g file) it worked. But I am using it to test my real file it doesn't work. I don't know why. Sorry to have you go through all this. Thanks.

Real file:

egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
ACT_MSTR                                          130.18.156.193 294705808
ACT_MSTR                                          130.18.156.191 294705808
CC_PROVIDERS                                  130.18.156.193 2697541582
CC_PROVIDERS                                  130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS             130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS             130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD     130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD     130.18.156.191 2817260696
Which character/number did the awk ' { print NF } ' file1 output?  Was it the number 2?

awk ' ($1 >1 ) { print $2 } ' deals with the output of uniq -c | sort -rn
uniq -c returns a table:
number_of_occurances   Element

sort -rn does a reverse sort based on the first column such that items occur frequently are at the top.

The awk ' ($1>1) deals with getting the first column and if the value is greater than 1 it means that the entry/element is not unique and based on your question is outputed for the -iv grep which negates the match of each line.

here is the logic:

From the inside out.
1) determine all the elements that are not unique.  This is done by splitting the data columns of file1 and only processing field2.
The items in field2 are then sorted (sort) . grouped (uniq -c). reverse sorted (sort -rn). The list of non-unique elements is outputed.  At the conclusion of this step, you have a list of:
294705808
3353753290

2) use the above list as a negation i.e. any line that matches this matter is discarded which is egrep/grep -v.  Because this is a per line pattern, the use of the -i flag for egrep is needed to run the check for each occurrence of a pattern

The file command not work, because you now have three elements and the element on which you want to trigger is in the third position and not in the second position as your first post suggested.
The fix is as you see is to use the third ($3) element instead of the seconds ($2) when using awk.
I did that as you can see from the below. It's weird that it didn't work.

egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
You changed too many, you only need to change the last first one
The one that deals with building the non-exclusive/unique pattern

egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1
It didn't work, but I accept your solution though.
It works for me:
Runing
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1

where file1 is:
ACT_MSTR                                          130.18.156.193 294705808
ACT_MSTR                                          130.18.156.191 294705808
CC_PROVIDERS                                  130.18.156.193 2697541582
CC_PROVIDERS                                  130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS             130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS             130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD     130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD     130.18.156.191 2817260696

file1
CC_PROVIDERS                                  130.18.156.193 2697541582
CC_PROVIDERS                                  130.18.156.191 2697541581


What separates the columns? are they a mixture of tabs and spaces?

Yes. They are mixture of spaces.
are they spaces and tabs or just spaces.

The awk ' { print NF,$3} ' file1
should return the same number of fields
3 based on your latest example.
If the count varies, this will cause problems
and the second item should be the element on which you want to filter.
I ran the awk ' { print NF,$3} ' file1. It returned the below:


3 650947209
3 1667895989
3 1667895989
3 4232622253
3 4232622253
3 3436253646
The items from running the command should return the two lines

that have 650947209 and 3436253646.

Try it one item at a time.
what is the output from running:
awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
Does it return:
1667895989
1667895989
?

Note the pattern match for 650947209 will match *650947209 as well as 650947209* where * is any combination of letters/numbers/characters
$ cat file1
ACT_MSTR                                      130.18.156.193 294705808
ACT_MSTR                                      130.18.156.191 294705808
CC_PROVIDERS                              130.18.156.193 2697541582
CC_PROVIDERS                              130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS         130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS         130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696

$ awk ' { print $3 } ' file2| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '

Returning:
3353753290
294705808
2817260696


Do you see any problem of the returning results? Thanks.



I just found out that if field3 contains zero, it doesn't work. Empty string will return when executing the command.


$ cat file1

ACT_MSTR                                      130.18.156.193 294705808
ACT_MSTR                                      130.18.156.191 294705808
CC_PROVIDERS                              130.18.156.193 2697541582
CC_PROVIDERS                              130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS         130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS         130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTER_NEW  130.18.156.193 0
CLOSED_WIN_JOB_CENTER_NEW  130.18.156.191 0
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696


$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '





The problem is that 0 matches 130 in the second field.
-v tells it to discard that line which is why there is no output.
In my case, is there an alternative way to get it to work? Besides deleting lines that have zero in fied3. Thanks.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Can I combine those commands in one bash script? How? Thanks.


$ awk '{printf "%-45s %-14s %s\n", $2,$4,$7}' test > file1

$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } ' | /opt/mysql/5.1.46/test6.sh file1


You can not.  The problem is that the output of the first formating awk, is the input, It is possible to replace the awk non-unique data extraction into a perl script

I am puzzled if this is data from a database, why not run the query on the database itself. i.e. group by field as a negative entry within a where clause.


What is in the test6.sh script or is it the name you gave to the perl script in my prior comment?

I'll look into a single script such that you will have the output by executing the following:
/opt/mysql/5.1.46/test6.sh test 2 4 7


But you can actually use test as the input instead of file1.  as long as you adjust the fields i.e. instead of $3 you would use $7.

The perl script will not need to be modified as it actually looks at the last element in the array which will be the last column of test if there are only 7 columns.
The below should take the test file as an argument and output the unique report.

If you want to post a few rows from test I could modify the script further.
Syntax wise the script should work, the processing should be fine as well.

#!/usr/bin/perl

if ( $#ARGV<0) { #test to make sure an argument is provided
   print "Usage: $0 filename\n";
   exit;
}
elsif  ( [ -f $ARGV[0] ] ) { #check to make sure the argument provided is a regu
lar file
open FILE,"$ARGV[0]" || die "Failed to open the file $ARGV[0] for reading\n";
while (<STDIN>) {
chomp();
@array=split(/\s+/,$_);
#your example deals with row 2,4,7 in awk which start at 1.  Perl/split starts a
t 0 so the change is 1,3,6
$hash{$array[6]}+=1;
push (@filearray,$_); #create an array that contains all the lines from the file
 referenced in $ARGV[0]
} #finished building the hash and create
close (FILE);
foreach $row (@filearray) { #second pass at the file content which were pushed o
nto @filearray
@array=split(/\s+/,$row);
if ( exists $hash{$array[6]} && $hash{$array[6]}>1 ) { #there are two entries

}
else {
       printf "%-45s %-14s %s\n",$array[1],$array[3],$array[6];
} #close if

} #close foreach

} #close if

Open in new window

Arnold is very patient and professional. His Unix skills are awesome.