asked on

Store value from text file into array and appending a unique value

Hi all,

I have a list of values stored in a text file as per below:
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024

I am trying to add an extra column with a unique value/number at the end of each line before storing the output into an array:

1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040|1
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040|2
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024|3
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024|4

I have tried the following code but when I print the output, only one line is displayed instead of all 4.

      my $i=1;

      while(<GL>) {

      @result = split /\|/;
      push(@result, "|$i");
      $i++;

      }
      close(GL);
      return @results;

Would appreciate some advice. Thank you
Jason

tel2

Hi Jason,

Could you please tell me what this is for? Is it homework?

You have a one dimensional array, but it sounds as if you're wanting to store 2 dimensions of data into it. Do you want each row in a separate element of a 1 dimensional array, like this:
@result[1] = 1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040|1
@result[2] = 1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040|2
or split the fields up, and store them in a 2 dimensional array like this:
@result[1][0] = 1BSH0008655
@result[1][1] = AEO0001
@result[1][2] = 1BSH0008200
@result[1][3] = CI
@result[1][4] = SPPNP3040
@result[1][5] = 1
@result[2][0] = 1BSH0008655
...etc...

Right now you're just overwriting the same one-dimensional array with the fields of each row.

Also, where did you open the GL file?
Also, you're trying to return @results (note the "s"), but the data is in @result.
Also, this doesn't look like a subroutine, so you can't "return" from it anyway. Or is this just part of the code?
Also, you should probably do a "chomp;" before the "...split..." line, to remove the newline from the last field.
Also, instead of pushing "|$i", you should just push $i, since you have split the fields based on the "|" separater.

You could detect some of these issues by using the trick ozo gave you in your first question, i.e.:
perl -Mdiagnostics yourscript.pl
and/or you could put this at the beginning of your code:
use strict;
use warnings;
But then you have to declare everything (e.g. with "my").

Awaiting your answers to my questions above...

tel2

oheil

Only the last line is written out, because you overwrite the array @result in each step of the while loop.

      my $i=1;

      @result = ();
      while(<GL>) {

      @tmp_result = split /\|/;
      push(@tmp_result, "$i");

      push(@result, @tmp_result);
      $i++;

      }
      close(GL);
      return @results;

Open in new window

Now @result is an array of arrays. Each array in @result contains the splitted elements of the corresponding line plus the new new integer value.
If you want an array of the lines change
push(@result, @tmp_result);
top
$tmp_array = join('|',@tmp_array)
push(@result, $tmp_result);

Oli

FishMonger

Based on the problem statement I would not assume the OP wants a 2D array and based on my read, a simple assignment would suffice.

#!/usr/bin/perl

use strict;  
use warnings;

my $i = 0;
my @result;

while ( <DATA> ) {
    chomp;
    $result[$i++] = "$_|$.\n";
}
print @result;

__DATA__
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024

Open in new window

FishMonger

If a 2D array is what is wanted, then I'd probably do it this way.

#!/usr/bin/perl

use strict;  
use warnings;
use Data::Dumper;

my $i = 0;
my @result;

while ( <DATA> ) {
    chomp;
    $result[$i++] = [ split /\|/ ];
}
print Dumper \@result;

__DATA__
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024

Open in new window

Which outputs:

$VAR1 = [
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3040'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3040'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3024'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3024'
]
];

tel2

Not bad, FishMonger, except I don't think you can simply do this:
$result[$i++] = [ split /\|/ ];
as Jason's wanting $i on the end, so I guess something like this is needed:
$result[$i] = [ split /\|/, "$_|$i" ];
$i++;

FishMonger

tel2,

But why the 2D array? The OP never indicated that the 2D was needed/wanted. If it isn't wanted, then my first post would be the best option of what's been suggested so far.

For the 2D assignment, the $i array index would need to be initialized to 0 and the value being added to the end, based on the OP's sample, would need to be $i + 1.

I'd use post increment on $i and $. for the value to accomplish that. (assuming all lines in the file are being processed and none are skipped.)

$result[$i++] = [ split(/\|/, $_), "|$." ];

Open in new window

tel2

Hi FishMonger,

> But why the 2D array?
I assume you mean something like: "Good point, but why the 2D array?". We can only guess what Jason wants, and I discussed the 2 possibilities (1D & 2D) in my post, as did you. I guessed 2D is possibly what he's wanting, based on the fact that he had "split" the fields of each line into separate array elements. Maybe he split it without good reason. Maybe he split it so he could easily access the individual fields later. I don't think we're seeing all his code, so who knows what he's gonna do after this section. That's why I covered both possibilities, and asked him the question "Do you want each row...". But yes, 1D is most likely all he needs, though.

> For the 2D assignment, the $i array index would need to be initialized to 0...
Good point, or it could be initialised to 1 (as Jason did), and then the "+ 1" would not be required.

> I'd use post increment on $i and $. for the value to accomplish that.
Nice.

FishMonger

>> For the 2D assignment, the $i array index would need to be initialized to 0...
>Good point, or it could be initialised to 1 (as Jason did), and then the "+ 1" would not be required.

But then you'd need $i -1 when indexing the array or latter on account for $result[0] being undef.

tel2

Not if I was doing it as I'd mentioned above, which was:
$result[$i] = [ split /\|/, "$_|$i" ];
$i++;
Of have I misunderstood you?

FishMonger

Arrays are zero indexed, so if you initialize $i to 1, then your first assignment will be assigning the second element, not the first.

#!/usr/bin/perl

use strict;  
use warnings;
use Data::Dumper;

my $i = 1;
my @result;

while ( <DATA> ) {
    chomp;
    $result[$i] = [ split /\|/, "$_|$i" ];
    $i++
}
print Dumper \@result;

__DATA__
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|SPPNP3024

Open in new window

$VAR1 = [
undef,
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3040',
'1'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3040',
'2'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3024',
'3'
],
[
'1BSH0008655',
'AEO0001',
'1BSH0008200',
'CI',
'SPPNP3024',
'4'
]
];

tel2

I know, but what problem does that cause, FishMonger?

FishMonger

It means that you need to remember to account for that undefined element later in the script. Forgetting to account for it later introduces a common programming bug known as "off by 1". http://en.wikipedia.org/wiki/Off-by-one_error

A better question to ask yourself is, "why intentionally code it in such a way that easily introduces bugs, when the proper solution would be to use the correct indexes?"

Jason_Sutiono

ASKER

Hi guys,

Thanks for the feedbacks. @tel2 yes I have only displayed a portion of the code. Its for work actually and im obviously new to perl. Not much of a programmer either. But I do find learning perl quite fun(if you do get the code working of course!)

Basically I need to process over 200k lines from a text file. After putting each line into array and assigning the integer, I would need to access the individual fields. Would you guys suggest using the 2d array or 1d array? I'll need time to try out the aforementioned suggestions. I really appreciate the overwhelming support.

FishMonger

Based on the info you've given so far a 1D array would be the proper choice.

tel2

Hi FishMonger,

Fair point about the "off by 1" thing. Depends on the kind of accessing that's going to happen next, I guess, but 0 is safer. It's just that many (human) types prefer 1, I guess, and I was trying to cater to that preference that seemed to be evident in Jason's original post.

Why are you suggesting a 1D array if Jason says he "would need to access the individual fields"?

FishMonger

I'm suggesting the 1D array because it's a little easier to deal with for beginners and in most cases I try to follow the "KISS" principle. However, since we don't have enough details to see the bigger picture, the choice comes down to a flip of the coin. It may turn out that the 2D array is a better choice.

The approaches suggested so far dictates that you loop over the data at least twice. In general, that may not be the best approach. I generally try to loop over the data once processing each record as I go. However, that is not always possible or the best approach.

The devil is in the details which we don't have.

Jason_Sutiono

ASKER

Hi all,

To make it easier I'll outline all that I am trying to achieve.

I have got the following text file as the input:

Col 0 | Col 1 | Col 2 | 3| 4 |5 | 6 | 7 | 8 |9| 10 | 11
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|-381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|-45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|107.00|1|15-JUN-2004|SPPNP3044
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|207.00|1|15-JUN-2004|SPPNP3044
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|423|BN03054C|0001Q1|427.50|1|15-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|423|BN03054C|0001Q1|527.50|1|15-JUN-2004|SPPNP3040

I need to sum the amount in column 8 based on the reference column (column 5). If the sum is 0, exclude them. otherwise, include all entries in the output file.

The output I am trying to achieve is as per below:

1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|107.00|1|15-JUN-2004|SPPNP3044
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|207.00|1|15-JUN-2004|SPPNP3044
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|423|BN03054C|0001Q1|427.50|1|15-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|423|BN03054C|0001Q1|527.50|1|15-JUN-2004|SPPNP3040

Due to the fact that there are no unique 'primary id' for each line, I actually appended the extra integer to act a 'dummy id' since hash would only pick 1 line with unique value as the key.

I would later on drop the dummy ID to allow me to display everything that has a difference in the sum value. Otherwise, I will end up with:(which I am trying to avoid)

1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|307.00|1|15-JUN-2004|SPPNP3044
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|423|BN03054C|0001Q1|955|1|15-JUN-2004|SPPNP3040

I have attached my code. I tried to incorporate FishMonger's suggestion. But I am getting the error 'use of uninitialized value in concatenation (.) or string ./reconcile.pl line 27. Also, I get the following lines in the output file:

|ARRAY(0x80e6a10)|ARRAY(0x8063504)|ARRAY(0x80635b8)|ARRAY(0x806366c)|ARRAY(0x8063720)|ARRAY(0x80637d4)|ARRAY(0x8063888)|ARRAY(0x80e8618)|ARRAY(0x80e86cc)|ARRAY(0x80e8780)|ARRAY(0x80e8834)

Hope its clear enough. Thanks in advance!

#!/usr/bin/perl -s

use strict;
use warnings;
use Data::Dumper;

#-------------------------------------------------------------------------
# Configuration Variables
my $data = "/u/xi6505/pronto/cus/imports/test.txt";
my $file = "/u/xi6505/pronto/cus/imports/sample.csv";
#-------------------------------------------------------------------------

my @cola;
my %price;
my %columns;
my $num=0;

	@cola = process_import_file($data);
	
	#print Dumper \@cola;
	
	for (@cola){
	chomp;
	$price{$cola[5]} += $cola[8];
    $a{$cola[5]}{$cola[12]}="$cola[0]|$cola[1]|$cola[2]|$cola[3]|$cola[4]|$cola[5]|$cola[6]|$cola[7]|$cola[8]|$cola[9]|$cola[10]|$cola[11]";
	
	}
	

open(INFO, ">$file");	# Open for output

foreach my $id (sort keys %a) {
   
   
   foreach my $name (keys %{$a{$id}}) { #for each price(id)
		if ($price{$id} != $num) {

	print INFO "$a{$id}{$name}\n";

	 
	 }
	 }
	 }
	 
	 close INFO;
	 

   
# Takes file name as input and returns
sub process_import_file {
	
	open(GL,$data) or debug(0,"Can't parse po file - $data",1);
	my @result;
	my $i=1;
	
	while(<GL>) {
    chomp;
    $result[$i] = [ split /\|/, "$_|$i" ];
    $i++;
	
}

	
	close(GL);
	return @result;
	
	
}

Open in new window

tel2

Hi Jason,

Based on the info you gave prior to your last post, i.e. you "would need to access the individual fields", I think a 2D array would be the proper choice, as that would allow easy access to every field of every row.

Having seen your last post, it looks as if there's more work left than I have time for, so I'll leave this to FishMonger or anyone else who has time.

tel2

...that is, my 2D array suggestion, is, in my view, in line with the "KISS" principle, because it's generally simpler to access individual fields when they are in individual array elements.

FishMonger

The warning is due to the fact that your sub is building a 2D array, but you then loop over that array as if it were a 1D array.

You also didn't declare the %a hash, which was probably a typo when posting, because that missing declaration would generate compilation errors preventing the script from running.

Unless the data needs to be sorted by "col5", I'd not use either the 2D array or the HoH. Instead, I'd simply process the file line-by-line which is more efficient. If you do need it sorted by that field, I'd use a HoA (Hash-of-Arrays).

Other recommendations:

1) Fix you indentation, it's all over the map.

2) Use the 3 arg form of open and a lexical var for the filehandle.

3) ALWAYS check the return code of your open calls and take proper action if they fail.

If you don't need the sorting, your script could be reduced to this:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

# Configuration Variables
my $data = "/u/xi6505/pronto/cus/imports/test.txt";
my $file = "/u/xi6505/pronto/cus/imports/sample.csv";
my $num  = 0;

open(my $gl_fh, '<', $data) or debug(0, "Can't parse po file '$data' <$!>", 1);
open(my $csv_fh, '>', $file) or die "can't open '$file' <$!>";

while (my $line = <$gl_fh> ) {
    my @csv_data = split /\|/, $line;
    next if $csv_data[5] + $csv_data[8] == $num;
    print {$csv_fh} $line;
}

close $gl_fh;
clese $csv_fh;

Open in new window

tel2

Nice work, FishMonger.

Keeping you on...

tel2

FishMonger

Since we're only concerned with 2 fields in the decision, an array slice would be better.

    my ($fld5, $fld8) = (split /\|/, $line)[5,8];
    next if $fld5 + $fld8 == $num;

Open in new window

Jason_Sutiono

ASKER

Hi Fishmonger,

Thank you for your help. I just tried your code. But it is printing everything as it is unto the output instead of adding col 8 and excluding them if the sum is = 0.

Here is my input

1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|-381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|-45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|107.00|1|15-JUN-2004|SPPNP3044

After I run the code the output is:
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|-381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|44|BN03054B|00015K|381.64|1|07-JUN-2004|SPPNP3040
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|425|CM03054G|0001PM|-45.00|1|15-JUN-2004|SPPNP3024
1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|107.00|1|15-JUN-2004|SPPNP3044

Instead of printing just:

1BSH0008655|AEO0001|1BSH0008200|CI|03-MAY-2004|424|BN030504|0001PW|107.00|1|15-JUN-2004|SPPNP3044

since the 1st 2 lines (-381.64+381.64) = 0 based on reference 44 and 3rd and 4th lines (45+-45) = 0 based on reference 425 and so are excluded.

I guess I wasn't clear in explaining my requirements earlier.

Would appreciate your advice again. Thank you.

ASKER CERTIFIED SOLUTION

FishMonger

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial