Solved

Updating records

Posted on 2003-10-28
27
217 Views
Last Modified: 2010-03-05
Input:

 key1       record 1
 key 1       record 2
 key 2      record 1
 key 2      record 3

 Output:

 key 2 record1
 Key 2 record2
 Key 2 record3


I tried to update the records(removal of duplicate) by using Perl. If the second column has repeatation, then the new record will fetch the particular key( for eg., in this case key2) and update it for the other records present in the different key(eg., since record 1 is there on both keys, it took the new key and also update the key for record 2). Could anyone help me to write the script in Perl?

I tried like this:

while (<FILE>) {
  my ($key, @data) = split;
  foreach $val (@data) {
    $hash{$key}{$val} = 1;
  }
}
0
Comment
Question by:johnperl
  • 10
  • 6
  • 6
  • +1
27 Comments
 
LVL 8

Expert Comment

by:inq123
ID: 9632816
Hi johnperl,

You only specified "If the second column has repeatation" meaning your data has only one column, but in your code it seemed as if you were checking for multicolumn data.  If it's multicolumn then you need to explain your choice of new key more and how you decide duplicates.  But assuming that your code wasn't specific to this question you asked, here's a code that works for you, and also gives you idea even if multicolumn is used:

my (%data, %newdata);
while(<FILE>)
{
  chomp;
  my ($key, $val) = split /\s+/;
  $data{$val} = $key; # this way we eliminates duplicate values, and last key with this value wins as shown in your example
}
foreach my $val (keys %data)
{
#  $newdata{$data{$val}} = $val; # now let's get the data back into correct order, this isn't necessary unless you need data for further manipulation
  print "$data{$val} $val\n"; # this is the output you want
}

Cheers!
0
 

Author Comment

by:johnperl
ID: 9633902

Hi inq123,

Thanks for your quick reply. But I am getting error in the above script like below:

"Missing $ on loop variable in the line - foreach my $val (keys %data)"

Could you help me to come out of this error?

thanks in advance.

Johnperl
0
 
LVL 8

Expert Comment

by:inq123
ID: 9634026
Must be a typo on your part.  Please copy and paste code below:

key1      record1
key1      record2
key2     record1
key2     record3

Save above as "test.txt".  Then save the code below as "test.pl" and then launch it by "perl test.pl":

my (%data, %newdata);
open(FILE, "test.txt");
while(<FILE>)
{
  chomp;
  my ($key, $val) = split /\s+/;
  $data{$val} = $key; # this way we eliminates duplicate values, and last key with this value wins as shown in your example
}
foreach my $val (keys %data)
{
#  $newdata{$data{$val}} = $val; # now let's get the data back into correct order, this isn't necessary unless you need data for further manipulation
  print "$data{$val} $val\n"; # this is the output you want
}

Now as I said, launch it by perl test.pl, and the output is:

key2 record1
key1 record2
key2 record3
0
 

Author Comment

by:johnperl
ID: 9634132
Hi inq123,

Tahnks, for your reply. Yes, I got the output as you mentioned above. But my output should be like below:

Output:

 key 2 record1
 Key 2 record2 ( instead of key1 record2)
 Key 2 record3

It has to update from key1 to key2 for record2 also. Since record1 is getting update from key1 to key2, it has to automatically update all the records present in that key(eg key1 to key2). I know it's quite complicated. Can you help me ?

Thanks in advance
Johnperl.

I changed your script as below:

my (%data, %newdata);
while(<FILE>)
{
  chomp;
 my ($key, $val) = split /\s+/;
  $data{$val} = $key;
}
foreach $val(%data) {
$newdata{$data{$val}} = $val;
print "$data{$val} $val\n";
}
0
 
LVL 8

Expert Comment

by:inq123
ID: 9634249
I c.  I didn't understand this part of your requirement earlier.  It is indeed complicated and error-prone.  For example, if your file has:

key1      record1
key1      record2
key2     record1
key3     record2

Then should I update key1 to key2 or key3?  If I choose key2, then should I update key3 with key2 too?

For what you want to do I believe you should consider using a relational database instead, as the update by key would be definitely easier and it's also easier to check against potential problems like I described above.
0
 

Author Comment

by:johnperl
ID: 9634654
Hi inq123,

For input: ( you have mentioned above)
key1      record1
key1      record2
key2     record1
key3     record2

Output:
key3 record1
key3 record2
--------
Another sample input:

key1     record1
key1     record2
key2     record1
key2     record3
key3     record2

Output:

key3     record1
key3     record2
key3     record3

The above 2 examples are same logic. As you see the second example, record1 and record3 belongs to key2; while processing, record1 is repeating in key2, so in that stage the output will be :
key2   record1
key2   record2
key2   record3

While going to the next step ( say key3) , it has record2, since record2 is the repeatation from key2(above output), then all the records in key2 is updated to key3

So, the final output is:
key3     record1
key3     record2
key3     record3

Can you help me to sort this problem?
thanks in advance,
Johnperl

0
 
LVL 8

Expert Comment

by:inq123
ID: 9634878
It's no small job, although it's not too complicated. I'll deal with it tonight when I have more time.
0
 

Author Comment

by:johnperl
ID: 9634911

Hi inq123,

Thanks for your effort.

Johnperl
0
 
LVL 8

Assisted Solution

by:inq123
inq123 earned 30 total points
ID: 9638406
After some thoughts, I think this issue is still more complicated than you described.  You might not have thought about this very thoroughly.  Are you sure this is what you want to do and may I ask why you want to process this way?  There has to be a better way of doing what you want to do.  The reason why I said it's complicated is situation like below:

key1     record1
key3     record1
key2     record1
key3     record2

So first of all, as you described, the processing has to be retrospective in each step, meaning for each record we processed, we have to use the new rule to reprocess everything we processed before (N^2 algorithm, rather inefficient and unbearable if your file is very long).  In addition, first key1 and key3 would be replaced by key2, and then later key3 would stay for record2 only?  How about situation below:

key1     record1
key3     record1
key2     record1
key2     record2
key3     record2

so key1 and key3 got replaced by key2, then key3 replaces everything back again.  It's just awkward how things play out.  I seriously doubt that you want an algorithm like this, and I'm afraid both of our time could be better spent on improving the algorithm or approach itself, instead of implementing them as is.
0
 

Author Comment

by:johnperl
ID: 9639153
HI inq123,

Thanks. It's a nice discussion with you.
Actually column1 (keys) are sorted in ascending order.

example:
you have given above:            (The data is sorted with respect to first column)    
key1     record1                      key1     record1
key3     record1                      key2     record1
key2     record1                      key2     record2
key2     record2                      key3     record1
key3     record2                      key3     record2

I have written the script already for sorting the data with respect to first column(key1, key2, key3,.......). So it won't appear like what you mentioned above.
I will give you stepwise procedure for a sample input:

INPUT:

key1   record1
key1   record2
key2   record3
key2   record6
key3   record5
key3   record3
key3   record4
key4   record1
key4   record7

Stepwise procedure Output:
step 1:

1) key1  record1 record2
2) key2  record3 record6
3) key3  record5 record3 record4
4) key4  record1 record7

step2:

1) key1  record1 record2
2) key3  record5 record3 record4 record6        (since record3 is in both key2 and key3)
3) key4  record1 record7

step3:

1) key3  record5  record3  record4  record6
2) key4  record1  record7  record2            (Since record2 is in both key1 and key4, take key4)

So final output is:

key3  record5
key3  record3
key3  record4
key3  record6
key4  record1
key4  record7
key4  record2
------------------------------
I tried to get the output, but I can't able to get it. Could you help me to sort out it?
If you are having any queries, kindly let me know it.

Thanks for your effort and patience,
Johnperl.
0
 
LVL 20

Accepted Solution

by:
jmcg earned 65 total points
ID: 9639703
This is certainly a headscratcher.

Here's my first try:

use strict;
# use Data::Dumper;

my (%data, @order, %supercede, %keylist);

# input and discovery pass
while(<>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    } else {
      push @order, $Right;
    }
  $data{$Right} = $Left;
}

# relabel phase
foreach my $Right (@order) {
   my $Left = $data{$Right};
   $Left = $supercede{$Left} if exists $supercede{$Left};
   push @{$keylist{$Left}}, $Right;
  }

# print Dumper( \%data, \@order, \%supercede, \%keylist);

# output phase
foreach my $Left ( sort keys %keylist ) {
    print $Left, " ", $_ , "\n" for @{$keylist{$Left}};
  }

======

When applied to your last set of sample data, it gives:

key3 record3
key3 record6
key3 record5
key3 record4
key4 record1
key4 record2
key4 record7

========

which is the correct final association of left-hand sides with right-hand sides, but not the same order you got from working out the example by hand.
0
 
LVL 20

Expert Comment

by:jmcg
ID: 9639867
For my second attempt, I have:

use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);

# input and discovery phase
while(<>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}

# relabel phase
foreach my $Left (sort keys %supercede) {
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists $supercede{$Left};
   foreach my $Right( @movelist) {
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
foreach my $Left ( sort keys %keeplist ) {
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }

============

which produces, for your sample data, the same output as you had from working out the example by hand:

key3 record5
key3 record3
key3 record4
key3 record6
key4 record1
key4 record7
key4 record2

=============

I think this version properly handles cases where longer chains of superceded left-hand sides occur. The %data array at the end contains the final association between left-hand sides and right-hand sides.

Notes: If you are splitting on whitespace, you can skip doing a chomp.

I had to force myself to ignore your terminology for "key" since it is the right-hand side that ends up being unique. Whenever you see a relation where one side takes on unique values, you may want to think about a hash. But for preserving order, you need arrays — so this problem ended up with a hash of arrays as its central data structure.


0
 

Author Comment

by:johnperl
ID: 9639903
Hi Jmcg,

Thanks for your efforts.
I run the script, but I am getting error like:

Missing $ on loop variable at duplicate1.pl line 19. ---  where file name is duplicate1.pl

Below is the script you have sent to me:

use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);

# input and discovery phase
open(FILE, "input.txt");
while(<FILE>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}

# relabel phase
foreach my $Left (sort keys %supercede) {
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists $supercede{$Left};
   foreach my $Right( @movelist) {
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
foreach my $Left ( sort keys %keeplist ) {
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }

Since I am in learning stage I can't understand what is the error? Is it because of the "my" variable in line 19. Whether I have to delete "my" or I have to add some other thing before running the script. Could you help to come out of this problem?

Thanks in advance.
Johnperl.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 5

Assisted Solution

by:fantasy1001
fantasy1001 earned 30 total points
ID: 9639975
Make sure you're using a recent Perl, and include the line
#!/usr/bin/perl
or
#!/usr/local/bin/perl

You need to have at least version > 5.003 for this. Good job from jmcg. Cheers.
0
 
LVL 5

Expert Comment

by:fantasy1001
ID: 9640002
To solve, add declaration outside the foreach loop

my $Left, $Right;

Cheers.
0
 

Author Comment

by:johnperl
ID: 9640726

Hi fantasy1001 & jmcg,

I added my $Left, $Right;  - outside the foreach loop, still I got the same error. Could you run the script in your area and sort out the problem?
Below is the script :

#!/usr/bin/perl
use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);

# input and discovery phase
open(FILE, "input.txt");
while(<FILE>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}

# relabel phase
my $Left, $Right;
foreach my $Left (sort keys %supercede) {
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists $supercede{$Left};

my $Left, $Right;
   foreach my $Right( @movelist) {
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
my $Left, $Right;
foreach my $Left ( sort keys %keeplist ) {
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }

Thanks
Johnperl
0
 
LVL 5

Expert Comment

by:fantasy1001
ID: 9640742
No no, don't declare $Left and $Right in the foreach loop change

foreach my $Left (sort keys %supercede) {

to

foreach $Left(sort keys %supercede) {

also
foreach my $Right( @movelist) {
to
foreach $Right( @movelist) {

Thanks & Cheers

0
 

Author Comment

by:johnperl
ID: 9640833
Hi fantasy1001,

Thanks for your help. I made changes as your wish. But I got the same error. Could you correct it in the below script and run it?

#!/usr/bin/perl
use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);

# input and discovery phase
open(FILE, "input.txt");
while(<FILE>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}

# relabel phase
my $Left, $Right;
foreach $Left (sort keys %supercede) {
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists

$supercede{$Left};

my $Left, $Right;
   foreach $Right( @movelist) {
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
my $Left, $Right;
foreach $Left ( sort keys %keeplist ) {
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }
Thanks in advance,
Johnperl
0
 
LVL 5

Expert Comment

by:fantasy1001
ID: 9640919
Hi, what is your perl version, you have to > 5.005!

Thanks & Cheers
Fantasy
0
 
LVL 5

Expert Comment

by:fantasy1001
ID: 9640948
Try this also:
use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);

# input and discovery phase
open FILE, "input.txt" or die;
while(<FILE>)
{
  my ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}
close FILE;

# relabel phase
foreach (sort keys %supercede) {
   $Left = $_;
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists $supercede{$Left};
   foreach ( @movelist) {
      $Right = $_;
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
foreach ( sort keys %keeplist ) {
    $Left = $_;
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }


It is strange instead, because jmcg code run successfully on my unix box as well as window.
Cheers.
0
 
LVL 20

Expert Comment

by:jmcg
ID: 9641222
Don't you love the Internet? Even if you're asleep or at work, there are people somewhere working away at your problem.

Johnperl, I need to know what version of Perl you are working with. This code works on recent versions of Perl (ever since you could put "my" on a loop variable). If I'm to retrofit the code to an older Perl, I need to know what the target is. You can find out your version with the command

perl -v

Meanwhile, it would be a good idea for you to get an updated version of Perl. It's going to be a continual source of frustration if you cannot run examples with such commonly used constructs as I've used.
0
 
LVL 8

Expert Comment

by:inq123
ID: 9641568
just got up and saw so many posts already!

It has nothing to do with perl version or things suggested.  If you guys check the posts above, OP got this error while running my script too.  All I needed to do is to ask OP to copy and paste exactly and then the script worked without any change at all.  It must have something to do with OP's editor having some weird non-print character or some typo.

jmcg, why don't you try to do what I did in a post above that got my script running for OP?  That'd probably work.
0
 
LVL 20

Expert Comment

by:jmcg
ID: 9642408
Well, here's a version with the troublesome my on loop variables omitted.

use strict;
# use Data::Dumper;

my (%data, %supercede, %keeplist);
my ($Left, $Right);

# input and discovery phase
while(<>)
{
  ($Left, $Right) = split /\s+/;
  if( exists $data{$Right} ) {
      $supercede{$data{$Right}} = $Left;
    }
  $data{$Right} = $Left;
  push @{$keeplist{$Left}}, $Right;
}

# relabel phase
foreach $Left (sort keys %supercede) {
   my @movelist = @{$keeplist{$Left}};
   delete $keeplist{$Left};
   $Left = $supercede{$Left} while exists $supercede{$Left};
   foreach $Right( @movelist) {
      if( $Left ne $data{$Right}) {
          push @{$keeplist{$Left}}, $Right;
          $data{$Right} = $Left;
        }
    }
  }

# print Dumper( \%data, \%supercede, \%keeplist);

# output phase
foreach $Left ( sort keys %keeplist ) {
    print $Left, " ", $_ , "\n" for @{$keeplist{$Left}};
  }

0
 

Author Comment

by:johnperl
ID: 9647605

Hi jmcg,

Thanks jmcg, sorry for the belated reply because I am not having Unix installation. So I run with one of my friend's PC, in Unix mode it's working fine. It's a superb work by jmcg.

I am working in perl, version 5.003_07.

I run the latest script you have sent me in my PC, I am getting the following error:

syntax error at duplicate2.pl line 38, near ""\n" for "
syntax error at duplicate2.pl line 38, near "}}"
Execution of duplicate2.pl aborted due to compilation errors.

Anyhave, I will instal the new version of perl.

Once again I thank jmcg for the marvelous work and I thank inq123 and fantasy1001 for their help and support.

johnperl
0
 
LVL 20

Expert Comment

by:jmcg
ID: 9648214
Oh, that's old! It doesn't appear to take 'for' as a statement modifier (yet another convenience). You could replace line 38 with these lines:

for (@{$keeplist{$Left}}) {
       print $Left, " ", $_ , "\n";
    }
0
 

Author Comment

by:johnperl
ID: 9648276
Hi jmcg,

It's working fine now. Thanks for your effort.

Thanks a lot,
Johnperl
0
 
LVL 20

Expert Comment

by:jmcg
ID: 10038294
Nothing has happened on this question in over 2 months. It's time for cleanup!

My recommendation, which I will post in the Cleanup topic area, is to
split points between jmcg [65 pts], inq123 [30 pts] and fantasy101 [30 pts]

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

jmcg
EE Cleanup Volunteer
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now