Severcorr
asked on
UNIX BASH
$ cat file1
field1 field2
ACT_MSTR 294705808
ACT_MSTR 294705808
CC_PROVIDERS 2697541582
CC_PROVIDERS 2697541581
CLOSED_WIN_JOB_CENTERS 3353753290
CLOSED_WIN_JOB_CENTERS 3353753290
if records of field1 are the same, compare the values of the 2 records based on field2. If values are not the same, print ONLY those records.
I would like my new file to be as follow:
CC_PROVIDERS 2697541582
CC_PROVIDERS 2697541581
THANKS!
field1 field2
ACT_MSTR 294705808
ACT_MSTR 294705808
CC_PROVIDERS 2697541582
CC_PROVIDERS 2697541581
CLOSED_WIN_JOB_CENTERS 3353753290
CLOSED_WIN_JOB_CENTERS 3353753290
if records of field1 are the same, compare the values of the 2 records based on field2. If values are not the same, print ONLY those records.
I would like my new file to be as follow:
CC_PROVIDERS 2697541582
CC_PROVIDERS 2697541581
THANKS!
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Which linux are you using?
Which shell do you have?
Check whether you have all the command available egrep, awk, sort, uniq.
Which shell do you have?
Check whether you have all the command available egrep, awk, sort, uniq.
ASKER
Here my unix version and those commands are there.
Linux WIAMigrationDB 2.6.18-164.11.1.el5.centos .plus #1 SMP Wed Jan 20 18:49:35 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux WIAMigrationDB 2.6.18-164.11.1.el5.centos
You should have all those tools.
look in /var/log/rpmpkgs
yum install gawk grep coreutils
they should be available in the /bin/ directory
try this then:
/bin/egrep -iv "`/bin/awk ' { print $2 } ' file1| /bin/grep -v field|/bin/sort | /bin/uniq -c | /bin/sort -rn | /bin/awk ' ($1 >1 ) { print $2 } '`"
look in /var/log/rpmpkgs
yum install gawk grep coreutils
they should be available in the /bin/ directory
try this then:
/bin/egrep -iv "`/bin/awk ' { print $2 } ' file1| /bin/grep -v field|/bin/sort | /bin/uniq -c | /bin/sort -rn | /bin/awk ' ($1 >1 ) { print $2 } '`"
ASKER
I installed the gawk grep coreutils and it still doesn't work. Did you create a test file and tested scripts on your side? Thanks.
Yes, I used a copy of the post you made. and named in file1. and when executing the commands, file1 was in the Current working directory i.e. ls file1 returned information about the file.
Could you post what errors you are getting?
Could you post what errors you are getting?
ASKER
I didn't get any error, but it came back with an empty value.
what are the contents of file1?
post the following
awk ' { print NF } ' file1
Do all lines reflect/report 2?
post the following
awk ' { print NF } ' file1
Do all lines reflect/report 2?
ASKER
I ran awk ' { print NF } ' file1. It just printed out one character/number throughout. By the way what does awk ' ($1 >1' mean? Thanks.
ASKER
I am sorry. When I tested the above file (e.g file) it worked. But I am using it to test my real file it doesn't work. I don't know why. Sorry to have you go through all this. Thanks.
Real file:
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696
Real file:
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD
CLOSED_WIN_JOB_CENTERS_OLD
Which character/number did the awk ' { print NF } ' file1 output? Was it the number 2?
awk ' ($1 >1 ) { print $2 } ' deals with the output of uniq -c | sort -rn
uniq -c returns a table:
number_of_occurances Element
sort -rn does a reverse sort based on the first column such that items occur frequently are at the top.
The awk ' ($1>1) deals with getting the first column and if the value is greater than 1 it means that the entry/element is not unique and based on your question is outputed for the -iv grep which negates the match of each line.
here is the logic:
From the inside out.
1) determine all the elements that are not unique. This is done by splitting the data columns of file1 and only processing field2.
The items in field2 are then sorted (sort) . grouped (uniq -c). reverse sorted (sort -rn). The list of non-unique elements is outputed. At the conclusion of this step, you have a list of:
294705808
3353753290
2) use the above list as a negation i.e. any line that matches this matter is discarded which is egrep/grep -v. Because this is a per line pattern, the use of the -i flag for egrep is needed to run the check for each occurrence of a pattern
awk ' ($1 >1 ) { print $2 } ' deals with the output of uniq -c | sort -rn
uniq -c returns a table:
number_of_occurances Element
sort -rn does a reverse sort based on the first column such that items occur frequently are at the top.
The awk ' ($1>1) deals with getting the first column and if the value is greater than 1 it means that the entry/element is not unique and based on your question is outputed for the -iv grep which negates the match of each line.
here is the logic:
From the inside out.
1) determine all the elements that are not unique. This is done by splitting the data columns of file1 and only processing field2.
The items in field2 are then sorted (sort) . grouped (uniq -c). reverse sorted (sort -rn). The list of non-unique elements is outputed. At the conclusion of this step, you have a list of:
294705808
3353753290
2) use the above list as a negation i.e. any line that matches this matter is discarded which is egrep/grep -v. Because this is a per line pattern, the use of the -i flag for egrep is needed to run the check for each occurrence of a pattern
The file command not work, because you now have three elements and the element on which you want to trigger is in the third position and not in the second position as your first post suggested.
The fix is as you see is to use the third ($3) element instead of the seconds ($2) when using awk.
The fix is as you see is to use the third ($3) element instead of the seconds ($2) when using awk.
ASKER
I did that as you can see from the below. It's weird that it didn't work.
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $3 } '`" file1
You changed too many, you only need to change the last first one
The one that deals with building the non-exclusive/unique pattern
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1
The one that deals with building the non-exclusive/unique pattern
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1
ASKER
It didn't work, but I accept your solution though.
It works for me:
Runing
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1
where file1 is:
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696
file1
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
What separates the columns? are they a mixture of tabs and spaces?
Runing
egrep -iv "`awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '`" file1
where file1 is:
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD
CLOSED_WIN_JOB_CENTERS_OLD
file1
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
What separates the columns? are they a mixture of tabs and spaces?
ASKER
Yes. They are mixture of spaces.
are they spaces and tabs or just spaces.
The awk ' { print NF,$3} ' file1
should return the same number of fields
3 based on your latest example.
If the count varies, this will cause problems
and the second item should be the element on which you want to filter.
The awk ' { print NF,$3} ' file1
should return the same number of fields
3 based on your latest example.
If the count varies, this will cause problems
and the second item should be the element on which you want to filter.
ASKER
I ran the awk ' { print NF,$3} ' file1. It returned the below:
3 650947209
3 1667895989
3 1667895989
3 4232622253
3 4232622253
3 3436253646
3 650947209
3 1667895989
3 1667895989
3 4232622253
3 4232622253
3 3436253646
The items from running the command should return the two lines
that have 650947209 and 3436253646.
Try it one item at a time.
what is the output from running:
awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
Does it return:
1667895989
1667895989
?
Note the pattern match for 650947209 will match *650947209 as well as 650947209* where * is any combination of letters/numbers/characters
that have 650947209 and 3436253646.
Try it one item at a time.
what is the output from running:
awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
Does it return:
1667895989
1667895989
?
Note the pattern match for 650947209 will match *650947209 as well as 650947209* where * is any combination of letters/numbers/characters
ASKER
$ cat file1
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696
$ awk ' { print $3 } ' file2| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
Returning:
3353753290
294705808
2817260696
Do you see any problem of the returning results? Thanks.
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTERS_OLD
CLOSED_WIN_JOB_CENTERS_OLD
$ awk ' { print $3 } ' file2| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
Returning:
3353753290
294705808
2817260696
Do you see any problem of the returning results? Thanks.
ASKER
I just found out that if field3 contains zero, it doesn't work. Empty string will return when executing the command.
$ cat file1
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTER_NEW 130.18.156.193 0
CLOSED_WIN_JOB_CENTER_NEW 130.18.156.191 0
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.193 2817260696
CLOSED_WIN_JOB_CENTERS_OLD 130.18.156.191 2817260696
$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
$ cat file1
ACT_MSTR 130.18.156.193 294705808
ACT_MSTR 130.18.156.191 294705808
CC_PROVIDERS 130.18.156.193 2697541582
CC_PROVIDERS 130.18.156.191 2697541581
CLOSED_WIN_JOB_CENTERS 130.18.156.193 3353753290
CLOSED_WIN_JOB_CENTERS 130.18.156.191 3353753290
CLOSED_WIN_JOB_CENTER_NEW 130.18.156.193 0
CLOSED_WIN_JOB_CENTER_NEW 130.18.156.191 0
CLOSED_WIN_JOB_CENTERS_OLD
CLOSED_WIN_JOB_CENTERS_OLD
$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } '
The problem is that 0 matches 130 in the second field.
-v tells it to discard that line which is why there is no output.
-v tells it to discard that line which is why there is no output.
ASKER
In my case, is there an alternative way to get it to work? Besides deleting lines that have zero in fied3. Thanks.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Can I combine those commands in one bash script? How? Thanks.
$ awk '{printf "%-45s %-14s %s\n", $2,$4,$7}' test > file1
$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } ' | /opt/mysql/5.1.46/test6.sh file1
$ awk '{printf "%-45s %-14s %s\n", $2,$4,$7}' test > file1
$ awk ' { print $3 } ' file1| grep -v field|sort | uniq -c | sort -rn | awk ' ($1 >1 ) { print $2 } ' | /opt/mysql/5.1.46/test6.sh
You can not. The problem is that the output of the first formating awk, is the input, It is possible to replace the awk non-unique data extraction into a perl script
I am puzzled if this is data from a database, why not run the query on the database itself. i.e. group by field as a negative entry within a where clause.
What is in the test6.sh script or is it the name you gave to the perl script in my prior comment?
I'll look into a single script such that you will have the output by executing the following:
/opt/mysql/5.1.46/test6.sh test 2 4 7
But you can actually use test as the input instead of file1. as long as you adjust the fields i.e. instead of $3 you would use $7.
The perl script will not need to be modified as it actually looks at the last element in the array which will be the last column of test if there are only 7 columns.
I am puzzled if this is data from a database, why not run the query on the database itself. i.e. group by field as a negative entry within a where clause.
What is in the test6.sh script or is it the name you gave to the perl script in my prior comment?
I'll look into a single script such that you will have the output by executing the following:
/opt/mysql/5.1.46/test6.sh
But you can actually use test as the input instead of file1. as long as you adjust the fields i.e. instead of $3 you would use $7.
The perl script will not need to be modified as it actually looks at the last element in the array which will be the last column of test if there are only 7 columns.
The below should take the test file as an argument and output the unique report.
If you want to post a few rows from test I could modify the script further.
Syntax wise the script should work, the processing should be fine as well.
If you want to post a few rows from test I could modify the script further.
Syntax wise the script should work, the processing should be fine as well.
#!/usr/bin/perl
if ( $#ARGV<0) { #test to make sure an argument is provided
print "Usage: $0 filename\n";
exit;
}
elsif ( [ -f $ARGV[0] ] ) { #check to make sure the argument provided is a regu
lar file
open FILE,"$ARGV[0]" || die "Failed to open the file $ARGV[0] for reading\n";
while (<STDIN>) {
chomp();
@array=split(/\s+/,$_);
#your example deals with row 2,4,7 in awk which start at 1. Perl/split starts a
t 0 so the change is 1,3,6
$hash{$array[6]}+=1;
push (@filearray,$_); #create an array that contains all the lines from the file
referenced in $ARGV[0]
} #finished building the hash and create
close (FILE);
foreach $row (@filearray) { #second pass at the file content which were pushed o
nto @filearray
@array=split(/\s+/,$row);
if ( exists $hash{$array[6]} && $hash{$array[6]}>1 ) { #there are two entries
}
else {
printf "%-45s %-14s %s\n",$array[1],$array[3],$array[6];
} #close if
} #close foreach
} #close if
ASKER
Arnold is very patient and professional. His Unix skills are awesome.
ASKER