Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

unix sort -u -k PERL UNIX

Posted on 2010-08-17
10
493 Views
Last Modified: 2013-12-28
Hi,
I have been using the unix sort -u command within a Perl script.
I have an input file that has the following format:


2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900


timestamp,command,number1,number2,4 chars or less, 4 chars, 5 chars.

What i am interested in is the timestamp, and number1.

Sometimes there are duplicate lines in terms of number1.


2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

In this case, I would like to keep the latest timestamp only, so keep

2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

I tried vaious versions of sort -u -k
For -k it has the concept of first field, second field but since I have
a comma between fields I am not sure if I can use -k

In any case is there a way to use sort -u -k to keep only the duplicate
line with number1 with the latest timestamp (or a Perl trick).

example of input file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:30:54,npdisconnect,0496395646,0496395646,,BEMO,C4700
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700


desired output file:

2010-08-13T01:04:30,npdisconnect,0477203120,0477203120,,MOBM,C4900
2010-08-13T11:36:01,npdisconnect,0496395646,0496395646,,BEMO,C4700

Thanks.
0
Comment
Question by:Johannne1
  • 5
  • 3
  • 2
10 Comments
 
LVL 10

Expert Comment

by:jeromee
ID: 33458512
This should work for you:
perl -F/,/ -ane'print unless $s{$F[2]}; $s{$F[2]}++' /your/file/path

Good luck!
0
 

Author Comment

by:Johannne1
ID: 33458554
Hi Jeromee,
Hang on, I shall try this, and let you know in about 10-15 minutes.
0
 

Author Comment

by:Johannne1
ID: 33458599
Hi Jeromee,
Can you please try and explain, because is this a special way to compile the
perl if so then i can't use this, right now in my perl I have

system "/usr/bin/sort -u $outfile > $outfilesorted";
system "/usr/bin/cp " . $outfilesorted." ". $outputname;
system "/usr/bin/rm " . $outfilesorted;
system "/usr/bin/rm " . $outfile;


I can update the sort -u with some kind of sort -u -k
can you incorporate your solution which looks like a regular expression substitution in
the line sort -u above?
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 26

Expert Comment

by:wilcoxon
ID: 33458818
Jeromee's solution would be in place of your perl script and would give you the earliest time rather than the latest.

You can do what you want within a perl script like this.  Let me know if there are any problems (input not sorted by time, output must be sorted by time, etc).
#!/usr/local/bin/perl
use strict;
use warnings;

my $infile = shift; # pass input file name on command line
#my $infile = 'somefile'; # alternately hard-code it

# read the $infile and only keep the latest time row
# assumes input is in time order
open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

# output the file to STDOUT
# this will be unsorted but unique
print $data{$_}, "\n" foreach (keys %data);
# this would sort by number1 field
# print $data{$_}, "\n" foreach (sort keys %data);
# sorting by timestamp would be possible but *MUCH* more difficult

Open in new window

0
 

Author Comment

by:Johannne1
ID: 33458982
Hi Wilcoxon,
I will try this out. it will take me about 30 minutes.
Johanne
0
 
LVL 10

Expert Comment

by:jeromee
ID: 33459090
If you want the latest timestamp, this should work:
    perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' /your/file/path

0
 
LVL 10

Expert Comment

by:jeromee
ID: 33459163
if you want to replace this line in your script:
    system "/usr/bin/sort -u $outfile > $outfilesorted"
try this:
    system q(perl -F/,/ -ane'$s{$F[2]}=$_; END{print sort values %s}' ). "$outfile > $outfilesorted";


0
 

Author Comment

by:Johannne1
ID: 33459433
Hi Jeromee,
You are fast! I didn't see the above so i will try it. I was trying Wilcoxon's but I will try this out soon.
0
 

Author Comment

by:Johannne1
ID: 33459583
Hi Wilcoxon,

I got this to work. I am just taking a long time because I am trying to understand how the key
value pair works in your soution. First i have read about that Perl does not maintain the order
of elements in a hash. I look at your my$data  so you declare a map you take away the carriage
return and split the lines according to a comma. I think the $arr[2] is the number1.
I am not sure how foreach (keys %data) manages to output exactly what I want but it works.
I understand hashmaps and different maps data structures in java can you explain how
this foreach is working with the %data.
I added an output file and sometimes the 20 in the 2010 is chopped...not sure why. The output
is correct:


$ more perl_sort.pl
#!/usr/local/bin/perl
use strict;
use warnings;

my $inputFile;
my $infile = "inputXXX.txt";
my $outputFile;
my $outputname = "outputXXX.txt";


# read the $infile and only keep the latest time row when duplicate numbers occur
# assumes input is in time order

open IN, $infile or die "could not open $infile: $!";
my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;
close IN;

open ($outputFile, ">$outputname") || die "Can not open output file";
# output the file to STDOUT
# this will be unsorted but unique
#print $data{$_}, "\n" foreach (keys %data);

foreach (keys %data) {
    print $data{$_}, "\n";
    print $outputFile "  $data{$_}, \n";
}

here is the the ouput ifle it has a comma in front of the 2010

$ more ouputXXX.txt
, 2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
, 2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

is there a way to get:

2010-08-13T14:32:14,npbroadcast,0470544430,0470544430,BEMO,MOBM,C4900
2010-08-13T11:42:14,npdisconnect,0494047810,0494047810,,BEMO,C4700

or is this complicated I can live with the space , if it is complicated or split them out in another script.







0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 33459684
Sure.  I'll go over what the important lines are doing...

my %data = map { chomp; my @arr = split /,/; $arr[2] => $_ } <IN>;

is a short way of doing:

# loop over each line in the input
while (<IN>) {
    # remove the newline
    chomp;
    # split the line on comma and assign each piece to @arr
    my @arr = split /,/;
    # assign the full line to the hash with key of number1 (overwriting previous data)
    $data{$arr[2]} => $_; # $arr[2] = number1
}

foreach (keys %data) is technically unordered but it will sometimes consistently give you the ordering you want.

You can remove the comma and space by changing:
print $outputFile " $data{$_}, \n";
to
# {} added around $outputFile to make it clearer that it is an output file/stream
print {$outputFile} $data, "\n";

A quick hash primer...  In perl, a hash is effectively a list of the form (key1, val1, key2, val2, ..., keyX, valX).  The function "keys" effectively returns the even-number items (and "values" returns the odd-number items).  This implementation is why "%hash = map { $key => $val } @list" works (map is technically a list/array function).  So, "foreach (keys %data)" will loop over the keys of the %data hash one at a time.  Writing this made me realize that it could have been written more succinctly as "print $_, "\n" foreach (values %data)" and achieved the same thing (though possibly in a different order).

Let me know if you have any more questions...
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question