Solved

unix sort -u -k PERL UNIX

Posted on 2010-08-17
10
510 Views
Last Modified: 2013-12-28
Hi,
I have been using the unix sort -u command within a Perl script.
I have an input file that has the following format:


2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900


timestamp,command,number1,number2,4 chars or less, 4 chars, 5 chars.

What i am interested in is the timestamp, and number1.

Sometimes there are duplicate lines in terms of number1.


2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

In this case, I would like to keep the latest timestamp only, so keep

2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

I tried vaious versions of sort -u -k
For -k it has the concept of first field, second field but since I have
a comma between fields I am not sure if I can use -k

In any case is there a way to use sort -u -k to keep only the duplicate
line with number1 with the latest timestamp (or a Perl trick).

example of input file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700


desired output file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

Thanks.
0
Comment
Question by:Johannne1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 2
10 Comments
 
LVL 10

Expert Comment

by:jeromee
ID: 33458512
This should work for you:
perl -F/,/ -ane'print unless $s{$F[2]}; $s{$F[2]}++' /your/file/path

Good luck!
0
 

Author Comment

by:Johannne1
ID: 33458554
Hi Jeromee,
Hang on, I shall try this, and let you know in about 10-15 minutes.
0
 

Author Comment

by:Johannne1
ID: 33458599
Hi Jeromee,
Can you please try and explain, because is this a special way to compile the
perl if so then i can't use this, right now in my perl I have

system "/usr/bin/sort -u $outfile > $outfilesorted";
system "/usr/bin/cp " . $outfilesorted." ". $outputname;
system "/usr/bin/rm " . $outfilesorted;
system "/usr/bin/rm " . $outfile;


I can update the sort -u with some kind of sort -u -k
can you incorporate your solution which looks like a regular expression substitution in
the line sort -u above?
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 26

Expert Comment

by:wilcoxon
ID: 33458818
Jeromee's solution would be in place of your perl script and would give you the earliest time rather than the latest.

You can do what you want within a perl script like this.  Let me know if there are any problems (input not sorted by time, output must be sorted by time, etc).
#!/usr/local/bin/perl
use strict;
use warnings;

my $infile = shift; # pass input file name on command line
#my $infile = 'somefile'; # alternately hard-code it

# read the $infile and only keep the latest time row
# assumes input is in time order
open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

# output the file to STDOUT
# this will be unsorted but unique
print $data{$_}, "\n" foreach (keys %data);
# this would sort by number1 field
# print $data{$_}, "\n" foreach (sort keys %data);
# sorting by timestamp would be possible but *MUCH* more difficult

Open in new window

0
 

Author Comment

by:Johannne1
ID: 33458982
Hi Wilcoxon,
I will try this out. it will take me about 30 minutes.
Johanne
0
 
LVL 10

Expert Comment

by:jeromee
ID: 33459090
If you want the latest timestamp, this should work:
    perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' /your/file/path

0
 
LVL 10

Expert Comment

by:jeromee
ID: 33459163
if you want to replace this line in your script:
    system "/usr/bin/sort -u $outfile > $outfilesorted"
try this:
    system q(perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' ). "$outfile > $outfilesorted";


0
 

Author Comment

by:Johannne1
ID: 33459433
Hi Jeromee,
You are fast! I didn't see the above so i will try it. I was trying Wilcoxon's but I will try this out soon.
0
 

Author Comment

by:Johannne1
ID: 33459583
Hi Wilcoxon,

I got this to work. I am just taking a long time because I am trying to understand how the key
value pair works in your soution. First i have read about that Perl does not maintain the order
of elements in a hash. I look at your my$data  so you declare a map you take away the carriage
return and split the lines according to a comma. I think the $arr[2] is the number1.
I am not sure how foreach (keys %data) manages to output exactly what I want but it works.
I understand hashmaps and different maps data structures in java can you explain how
this foreach is working with the %data.
I added an output file and sometimes the 20 in the 2010 is chopped...not sure why. The output
is correct:


$ more perl_sort.pl
#!/usr/local/bin/perl
use strict;
use warnings;

my $inputFile;
my $infile = "inputXXX.txt";
my $outputFile;
my $outputname = "outputXXX.txt";


# read the $infile and only keep the latest time row when duplicate numbers occur
# assumes input is in time order

open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

open ($outputFile, ">$outputname") || die "Can not open output file";
# output the file to STDOUT
# this will be unsorted but unique
#print $data{$_}, "\n" foreach (keys %data);

foreach (keys %data) {
    print $data{$_}, "\n";
    print $outputFile "  $data{$_}, \n";
}

here is the the ouput ifle it has a comma in front of the 2010

$ more ouputXXX.txt
, 2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
, 2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

is there a way to get:

2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

or is this complicated I can live with the space , if it is complicated or split them out in another script.







0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 33459684
Sure.  I'll go over what the important lines are doing...

my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;

is a short way of doing:

# loop over each line in the input
while (<IN>) {
    # remove the newline
    chomp;
    # split the line on comma and assign each piece to @arr
    my @arr = split /,/;
    # assign the full line to the hash with key of number1 (overwriting previous data)
    $data{$arr[2]} => $_; # $arr[2] = number1
}

foreach (keys %data) is technically unordered but it will sometimes consistently give you the ordering you want.

You can remove the comma and space by changing:
print $outputFile " $data{$_}, \n";
to
# {} added around $outputFile to make it clearer that it is an output file/stream
print {$outputFile} $data, "\n";

A quick hash primer...  In perl, a hash is effectively a list of the form (key1, val1, key2, val2, ..., keyX, valX).  The function "keys" effectively returns the even-number items (and "values" returns the odd-number items).  This implementation is why "%hash = map { $key => $val } @list" works (map is technically a list/array function).  So, "foreach (keys %data)" will loop over the keys of the %data hash one at a time.  Writing this made me realize that it could have been written more succinctly as "print $_, "\n" foreach (values %data)" and achieved the same thing (though possibly in a different order).

Let me know if you have any more questions...
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Attention: This article will no longer be maintained. If you have any questions, please feel free to mail me. jgh@FreeBSD.org Please see http://www.freebsd.org/doc/en_US.ISO8859-1/articles/freebsd-update-server/ for the updated article. It is avail…
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Six Sigma Control Plans

628 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question