Parsing data using excel or perl. Finding a percent value for values in a certain range with three columns

libertyforall2
libertyforall2 used Ask the Experts™
on
I will have 2 sets of files. SO2 & SO4. I want to create an output file based on raw data that looks at values in the final column and creates an average for 5 different forecast levels. So for example for all values in 6 categories than outputs a file that has a category number, total values for that category, number of values correctly predicted, and the percentage of correctly predicted values for each category. Actual precent sign is not necessary. The only difference in the SO2 and SO4 values are the category levels. For SO2 Lets say there are 6 categories. Category 1 is all values .1 or less, category 2 would be greater than .1 and equal to or less than .2, category 3 would be .21 to 1.00, 4 would be 1.01 to 3.00, 5 would be 3.01 to 5.00, The script should look at all values that fall in that category (this is the second column in the output file), then find all values that fall in the same category for that data in the second to last column (the observed value column of the input file in the row to the left of the forecast column), This would be the 3 column in the output file. Then then script would find a percent of correctly predicted category levels based on those two values and round to the nearest tenth and place that value in the 4th column of the output file.

Sample input files are attached


The output files would look something like this (this are just bogus values not based on the actual input files.)

1 122 101 82.8%
2 34    13  38.2%
3 21     9   42.9%
4 11     5   45.5%
5  5      2   40.0%  
6  3      1   33.3%

so2location5scatter12.txt
so4location10scatter48.txt
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Could you provide a simple example with a limited subset of the 2 files?
I'd love to help but I don't understand the connection between the content of the 2 files.

Thanks.

Author

Commented:
Ok lets say I have this below.

08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03

The output file would look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

The exact format of the files are provided in the first post files.

Commented:
What's in SO4 and how does it related to the data in the final output file?
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
SO4 would have a different category level for 1-6, but I can modify that from the script provided for the SO2 once where I know where that portion of the code is located.

Commented:
Does this work for you?
Run it as
    perl /my/script.pl so2location5scatter12.txt

FYI, the resulting output file is:
1   0 458   0.0
2   0  18   0.0
3   0   3   0.0
4 255   0   0.0
5   0   0   0.0
6 224   0   0.0

Open in new window


which doesn't seem to make sense.
Hopefully, you will be able to make sense of it.

Good luck!


use strict;

my @observed;
my @forecasted;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[1,2];
   $observed[ valueToCategory($observed) ]++;
   $forecasted[ valueToCategory($forecasted) ]++;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $forecasted = $forecasted[$category]|| 0;
   printf("$category %3d %3d %5.1f\n", $observed, $forecasted, $observed ? ($forecasted*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

Author

Commented:
That doesn't seem to be correct. If you look at category one, there should be the highest number in column 2 which represents the total number of data points in category 1 in the last column of the input file, the 3 column in the output file represents the number of data points that have data in the same category in both the last and second to last data columns, The output files third column should always be equal to or less than the second columns output file.

Author

Commented:
If you run that script on this below what would you get?


Ok lets say I have this below.

08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03

The output file would look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

Author

Commented:
When I ran the script on the data file below


08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03


I got this output

1   8   2  25.0
2   0   1   0.0
3   1   2 200.0
4   0   1   0.0
5   0   1   0.0
6   1   3 300.0

It should look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

It appears that the script is making two errors. The first is that column 2 of the output file and column 3 of the output file are transposed. The second error is that column the third column of the output file should contain the number of data points where both forecast and observation data points are BOTH in the same category. It appears the script is merely stating how many data points are in the specified category for both columns instead of finding matching category totals.

Commented:
Check out the new version.
I still don't understand how you arrived at these numbers.
1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

For the category 1 for example for the values <= to 0.1, they are clearly 8 in column #2 and 2 in column # 3.
Please confirm the ranges here:
  
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

Open in new window






use strict;

my @observed;
my @forecasted;
my @same;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[2,1];
   my $observedCategory = valueToCategory($observed);
   my $forecastedCategory = valueToCategory($forecasted);
   $forecasted[ $forecastedCategory ]++;
   $observed[ $observedCategory ]++;
   $same[ $observedCategory ]++ if $observedCategory == $forecastedCategory;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $forecasted = $forecasted[$category]|| 0;
   my $same       = $same[$category]      || 0;
   printf("$category %3d %3d %5.1f\n", $observed, $forecasted, $observed ? ($same*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

Author

Commented:
This is the output I got with the new code

1   2   8 100.0
2   1   0   0.0
3   2   1  50.0
4   1   0   0.0
5   1   0   0.0
6   3   1  33.3

The first row and third row are incorrect. The reason they are still incorrect is because the third column should contain values where BOTH the forecast data and observed data are in the SAME category. This allows me to calculate what percent of forecast values fell in the same category as the observed values. For the first row of the updated script you provided, column three has 8. It should be two since only two rows of data contain values where both the forecast and observed column in the input file fall within category 1. It appears what the script is doing is calculating the same value for column three as it is for column two. Looking at the sample input code you see that only the two rows below contain values where both forecast and observed values fell within category one. The 8 value appears to be looking at ALL values where ONLY the observed values fall in category 1.

08-15-02 0.01 0.02
08-15-14 0.02 0.03

Commented:
We are getting there... :-)
The code below generated:
1   2   2 100.0
2   1   0   0.0
3   2   1  50.0
4   1   0   0.0
5   1   0   0.0
6   3   1  33.3

Open in new window


The only discrepancy is about category 3.

use strict;

my @observed;
my @same;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[2,1];
   my $observedCategory = valueToCategory($observed);
   my $forecastedCategory = valueToCategory($forecasted);
   $observed[ $observedCategory ]++;
   $same[ $observedCategory ]++ if $observedCategory == $forecastedCategory;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $same       = $same[$category]      || 0;
   printf("$category %3d %3d %5.1f\n", $observed, $same, $observed ? ($same*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

Author

Commented:
I think its correct. It looks like category three was an oversight in my manual look at the pretend numbers. Both .22 and .53 fall in category three so 2 1 50.0 is the correct number.

Author

Commented:
It didn't quite work with the actual data. I think because the actual input file contains a time stamp and  four columns instead of just three.

The actual files for SO2 are in this format. It was an oversight in the sample data posted.


10-21-2010 14:00:00 0.00 0.00
10-22-2010 14:00:00 0.00 0.01
10-23-2010 02:00:00 0.11 0.08
10-23-2010 14:00:00 0.00 0.02
10-24-2010 02:00:00 0.05 0.00
10-24-2010 14:00:00 0.00 0.03
10-25-2010 02:00:00 0.06 0.00
10-25-2010 14:00:00 0.01 0.00
10-26-2010 02:00:00 0.08 0.00
10-26-2010 14:00:00 0.01 0.00
10-27-2010 02:00:00 0.02 0.00
10-28-2010 14:00:00 0.01 0.09
10-29-2010 02:00:00 0.04 0.00
10-29-2010 14:00:00 0.00 0.01
10-30-2010 02:00:00 0.02 0.00


The output looked like this

1 458   0   0.0
2   18   0   0.0
3     3   0   0.0
4     0   0   0.0
5     0   0   0.0
6     0   0   0.0
stuff2.txt
Commented:
Change line 7 from:
   my($observed, $forecasted) = (split)[2,1];

Open in new window

to

   my($forecasted, $observed) = (split)[-2,-1];

Open in new window

That yields:

1 456 436  95.6
2  17   1   5.9
3   4   0   0.0
4   2   0   0.0
5   0   0   0.0
6   0   0   0.0

Open in new window

Author

Commented:
Works like a charm!

Commented:
Yeah!
Sorry for the number of back-and-forth...

Happy Perl'ing!

Author

Commented:
I have another questions that is similar but different please look at it below. Its slightly more complicated because it needs to skip some data.

http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_27285538.html

Author

Commented:
Jeromee I have asked a similar question on this thread http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_27291301.html The only difference is that now I want the total number of times column 4 of the input file DOES NOT fall in the category and the total number of times BOTH column 4 & 3 of each row DO NOT fall in the specified category.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial