Solved

unix sort -u -k PERL UNIX

Posted on 2010-08-17
10
467 Views
Last Modified: 2013-12-28
Hi,
I have been using the unix sort -u command within a Perl script.
I have an input file that has the following format:


2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900


timestamp,command,number1,number2,4 chars or less, 4 chars, 5 chars.

What i am interested in is the timestamp, and number1.

Sometimes there are duplicate lines in terms of number1.


2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

In this case, I would like to keep the latest timestamp only, so keep

2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

I tried vaious versions of sort -u -k
For -k it has the concept of first field, second field but since I have
a comma between fields I am not sure if I can use -k

In any case is there a way to use sort -u -k to keep only the duplicate
line with number1 with the latest timestamp (or a Perl trick).

example of input file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700


desired output file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

Thanks.
0
Comment
Question by:Johannne1
  • 5
  • 3
  • 2
10 Comments
 
LVL 10

Expert Comment

by:jeromee
ID: 33458512
This should work for you:
perl -F/,/ -ane'print unless $s{$F[2]}; $s{$F[2]}++' /your/file/path

Good luck!
0
 

Author Comment

by:Johannne1
ID: 33458554
Hi Jeromee,
Hang on, I shall try this, and let you know in about 10-15 minutes.
0
 

Author Comment

by:Johannne1
ID: 33458599
Hi Jeromee,
Can you please try and explain, because is this a special way to compile the
perl if so then i can't use this, right now in my perl I have

system "/usr/bin/sort -u $outfile > $outfilesorted";
system "/usr/bin/cp " . $outfilesorted." ". $outputname;
system "/usr/bin/rm " . $outfilesorted;
system "/usr/bin/rm " . $outfile;


I can update the sort -u with some kind of sort -u -k
can you incorporate your solution which looks like a regular expression substitution in
the line sort -u above?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 33458818
Jeromee's solution would be in place of your perl script and would give you the earliest time rather than the latest.

You can do what you want within a perl script like this.  Let me know if there are any problems (input not sorted by time, output must be sorted by time, etc).
#!/usr/local/bin/perl
use strict;
use warnings;

my $infile = shift; # pass input file name on command line
#my $infile = 'somefile'; # alternately hard-code it

# read the $infile and only keep the latest time row
# assumes input is in time order
open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

# output the file to STDOUT
# this will be unsorted but unique
print $data{$_}, "\n" foreach (keys %data);
# this would sort by number1 field
# print $data{$_}, "\n" foreach (sort keys %data);
# sorting by timestamp would be possible but *MUCH* more difficult

Open in new window

0
 

Author Comment

by:Johannne1
ID: 33458982
Hi Wilcoxon,
I will try this out. it will take me about 30 minutes.
Johanne
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 10

Expert Comment

by:jeromee
ID: 33459090
If you want the latest timestamp, this should work:
    perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' /your/file/path

0
 
LVL 10

Expert Comment

by:jeromee
ID: 33459163
if you want to replace this line in your script:
    system "/usr/bin/sort -u $outfile > $outfilesorted"
try this:
    system q(perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' ). "$outfile > $outfilesorted";


0
 

Author Comment

by:Johannne1
ID: 33459433
Hi Jeromee,
You are fast! I didn't see the above so i will try it. I was trying Wilcoxon's but I will try this out soon.
0
 

Author Comment

by:Johannne1
ID: 33459583
Hi Wilcoxon,

I got this to work. I am just taking a long time because I am trying to understand how the key
value pair works in your soution. First i have read about that Perl does not maintain the order
of elements in a hash. I look at your my$data  so you declare a map you take away the carriage
return and split the lines according to a comma. I think the $arr[2] is the number1.
I am not sure how foreach (keys %data) manages to output exactly what I want but it works.
I understand hashmaps and different maps data structures in java can you explain how
this foreach is working with the %data.
I added an output file and sometimes the 20 in the 2010 is chopped...not sure why. The output
is correct:


$ more perl_sort.pl
#!/usr/local/bin/perl
use strict;
use warnings;

my $inputFile;
my $infile = "inputXXX.txt";
my $outputFile;
my $outputname = "outputXXX.txt";


# read the $infile and only keep the latest time row when duplicate numbers occur
# assumes input is in time order

open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

open ($outputFile, ">$outputname") || die "Can not open output file";
# output the file to STDOUT
# this will be unsorted but unique
#print $data{$_}, "\n" foreach (keys %data);

foreach (keys %data) {
    print $data{$_}, "\n";
    print $outputFile "  $data{$_}, \n";
}

here is the the ouput ifle it has a comma in front of the 2010

$ more ouputXXX.txt
, 2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
, 2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

is there a way to get:

2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

or is this complicated I can live with the space , if it is complicated or split them out in another script.







0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 33459684
Sure.  I'll go over what the important lines are doing...

my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;

is a short way of doing:

# loop over each line in the input
while (<IN>) {
    # remove the newline
    chomp;
    # split the line on comma and assign each piece to @arr
    my @arr = split /,/;
    # assign the full line to the hash with key of number1 (overwriting previous data)
    $data{$arr[2]} => $_; # $arr[2] = number1
}

foreach (keys %data) is technically unordered but it will sometimes consistently give you the ordering you want.

You can remove the comma and space by changing:
print $outputFile " $data{$_}, \n";
to
# {} added around $outputFile to make it clearer that it is an output file/stream
print {$outputFile} $data, "\n";

A quick hash primer...  In perl, a hash is effectively a list of the form (key1, val1, key2, val2, ..., keyX, valX).  The function "keys" effectively returns the even-number items (and "values" returns the odd-number items).  This implementation is why "%hash = map { $key => $val } @list" works (map is technically a list/array function).  So, "foreach (keys %data)" will loop over the keys of the %data hash one at a time.  Writing this made me realize that it could have been written more succinctly as "print $_, "\n" foreach (values %data)" and achieved the same thing (though possibly in a different order).

Let me know if you have any more questions...
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I have been running these systems for a few years now and I am just very happy with them.   I just wanted to share the manual that I have created for upgrades and other things.  Oooh yes! FreeBSD makes me happy (as a server), no maintenance and I al…
Why Shell Scripting? Shell scripting is a powerful method of accessing UNIX systems and it is very flexible. Shell scripts are required when we want to execute a sequence of commands in Unix flavored operating systems. “Shell” is the command line i…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now