Link to home
Start Free TrialLog in
Avatar of libertyforall2
libertyforall2Flag for United States of America

asked on

Parsing data using excel or perl. Finding a percent value for values in a certain range with three columns

I will have 2 sets of files. SO2 & SO4. I want to create an output file based on raw data that looks at values in the final column and creates an average for 5 different forecast levels. So for example for all values in 6 categories than outputs a file that has a category number, total values for that category, number of values correctly predicted, and the percentage of correctly predicted values for each category. Actual precent sign is not necessary. The only difference in the SO2 and SO4 values are the category levels. For SO2 Lets say there are 6 categories. Category 1 is all values .1 or less, category 2 would be greater than .1 and equal to or less than .2, category 3 would be .21 to 1.00, 4 would be 1.01 to 3.00, 5 would be 3.01 to 5.00, The script should look at all values that fall in that category (this is the second column in the output file), then find all values that fall in the same category for that data in the second to last column (the observed value column of the input file in the row to the left of the forecast column), This would be the 3 column in the output file. Then then script would find a percent of correctly predicted category levels based on those two values and round to the nearest tenth and place that value in the 4th column of the output file.

Sample input files are attached


The output files would look something like this (this are just bogus values not based on the actual input files.)

1 122 101 82.8%
2 34    13  38.2%
3 21     9   42.9%
4 11     5   45.5%
5  5      2   40.0%  
6  3      1   33.3%

so2location5scatter12.txt
so4location10scatter48.txt
Avatar of jeromee
jeromee
Flag of United States of America image

Could you provide a simple example with a limited subset of the 2 files?
I'd love to help but I don't understand the connection between the content of the 2 files.

Thanks.
Avatar of libertyforall2

ASKER

Ok lets say I have this below.

08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03

The output file would look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

The exact format of the files are provided in the first post files.
What's in SO4 and how does it related to the data in the final output file?
SO4 would have a different category level for 1-6, but I can modify that from the script provided for the SO2 once where I know where that portion of the code is located.
Does this work for you?
Run it as
    perl /my/script.pl so2location5scatter12.txt

FYI, the resulting output file is:
1   0 458   0.0
2   0  18   0.0
3   0   3   0.0
4 255   0   0.0
5   0   0   0.0
6 224   0   0.0

Open in new window


which doesn't seem to make sense.
Hopefully, you will be able to make sense of it.

Good luck!


use strict;

my @observed;
my @forecasted;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[1,2];
   $observed[ valueToCategory($observed) ]++;
   $forecasted[ valueToCategory($forecasted) ]++;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $forecasted = $forecasted[$category]|| 0;
   printf("$category %3d %3d %5.1f\n", $observed, $forecasted, $observed ? ($forecasted*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

That doesn't seem to be correct. If you look at category one, there should be the highest number in column 2 which represents the total number of data points in category 1 in the last column of the input file, the 3 column in the output file represents the number of data points that have data in the same category in both the last and second to last data columns, The output files third column should always be equal to or less than the second columns output file.
If you run that script on this below what would you get?


Ok lets say I have this below.

08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03

The output file would look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3
When I ran the script on the data file below


08-15-02 0.01 0.02
08-15-14 0.02 0.03
08-16-02 0.00 0.11
08-16-14 0.01 0.22
08-17-02 0.72 0.53
08-17-14 0.01 2.02
08-18-02 0.02 4.03
08-18-14 0.02 6.03
08-19-02 5.02 7.03
08-19-14 0.02 6.03


I got this output

1   8   2  25.0
2   0   1   0.0
3   1   2 200.0
4   0   1   0.0
5   0   1   0.0
6   1   3 300.0

It should look like this

1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

It appears that the script is making two errors. The first is that column 2 of the output file and column 3 of the output file are transposed. The second error is that column the third column of the output file should contain the number of data points where both forecast and observation data points are BOTH in the same category. It appears the script is merely stating how many data points are in the specified category for both columns instead of finding matching category totals.
Check out the new version.
I still don't understand how you arrived at these numbers.
1 2 2 100
2 1 0     0
3 1 1 100
4 1 0     0
5 1 0     0
6 3 1   33.3

For the category 1 for example for the values <= to 0.1, they are clearly 8 in column #2 and 2 in column # 3.
Please confirm the ranges here:
  
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

Open in new window






use strict;

my @observed;
my @forecasted;
my @same;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[2,1];
   my $observedCategory = valueToCategory($observed);
   my $forecastedCategory = valueToCategory($forecasted);
   $forecasted[ $forecastedCategory ]++;
   $observed[ $observedCategory ]++;
   $same[ $observedCategory ]++ if $observedCategory == $forecastedCategory;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $forecasted = $forecasted[$category]|| 0;
   my $same       = $same[$category]      || 0;
   printf("$category %3d %3d %5.1f\n", $observed, $forecasted, $observed ? ($same*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

This is the output I got with the new code

1   2   8 100.0
2   1   0   0.0
3   2   1  50.0
4   1   0   0.0
5   1   0   0.0
6   3   1  33.3

The first row and third row are incorrect. The reason they are still incorrect is because the third column should contain values where BOTH the forecast data and observed data are in the SAME category. This allows me to calculate what percent of forecast values fell in the same category as the observed values. For the first row of the updated script you provided, column three has 8. It should be two since only two rows of data contain values where both the forecast and observed column in the input file fall within category 1. It appears what the script is doing is calculating the same value for column three as it is for column two. Looking at the sample input code you see that only the two rows below contain values where both forecast and observed values fell within category one. The 8 value appears to be looking at ALL values where ONLY the observed values fall in category 1.

08-15-02 0.01 0.02
08-15-14 0.02 0.03
We are getting there... :-)
The code below generated:
1   2   2 100.0
2   1   0   0.0
3   2   1  50.0
4   1   0   0.0
5   1   0   0.0
6   3   1  33.3

Open in new window


The only discrepancy is about category 3.

use strict;

my @observed;
my @same;
while( <> ) {
   chomp;
   my($observed, $forecasted) = (split)[2,1];
   my $observedCategory = valueToCategory($observed);
   my $forecastedCategory = valueToCategory($forecasted);
   $observed[ $observedCategory ]++;
   $same[ $observedCategory ]++ if $observedCategory == $forecastedCategory;
}

foreach my $category(1..6) {
   my $observed   = $observed[$category]  || 0;
   my $same       = $same[$category]      || 0;
   printf("$category %3d %3d %5.1f\n", $observed, $same, $observed ? ($same*100)/$observed : 0);
}


sub valueToCategory {
   my($value) = @_;

   my $category;
   if( $value <= .1 )    { $category = 1 }
   elsif( $value <= .2 ) { $category = 2 }
   elsif( $value <=  1 ) { $category = 3 }
   elsif( $value <=  3 ) { $category = 4 }
   elsif( $value <=  5 ) { $category = 5 }
   else                  { $category = 6 }

   return $category;
}

Open in new window

I think its correct. It looks like category three was an oversight in my manual look at the pretend numbers. Both .22 and .53 fall in category three so 2 1 50.0 is the correct number.
It didn't quite work with the actual data. I think because the actual input file contains a time stamp and  four columns instead of just three.

The actual files for SO2 are in this format. It was an oversight in the sample data posted.


10-21-2010 14:00:00 0.00 0.00
10-22-2010 14:00:00 0.00 0.01
10-23-2010 02:00:00 0.11 0.08
10-23-2010 14:00:00 0.00 0.02
10-24-2010 02:00:00 0.05 0.00
10-24-2010 14:00:00 0.00 0.03
10-25-2010 02:00:00 0.06 0.00
10-25-2010 14:00:00 0.01 0.00
10-26-2010 02:00:00 0.08 0.00
10-26-2010 14:00:00 0.01 0.00
10-27-2010 02:00:00 0.02 0.00
10-28-2010 14:00:00 0.01 0.09
10-29-2010 02:00:00 0.04 0.00
10-29-2010 14:00:00 0.00 0.01
10-30-2010 02:00:00 0.02 0.00


The output looked like this

1 458   0   0.0
2   18   0   0.0
3     3   0   0.0
4     0   0   0.0
5     0   0   0.0
6     0   0   0.0
stuff2.txt
ASKER CERTIFIED SOLUTION
Avatar of jeromee
jeromee
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Works like a charm!
Yeah!
Sorry for the number of back-and-forth...

Happy Perl'ing!
I have another questions that is similar but different please look at it below. Its slightly more complicated because it needs to skip some data.

https://www.experts-exchange.com/questions/27285538/Parsing-data-using-perl-to-find-a-trend-after-eliminating-bad-data.html
Jeromee I have asked a similar question on this thread https://www.experts-exchange.com/questions/27291301/Reversing-a-conditional-statement-in-perl-to-be-everything-except-instead-of-only-if.html The only difference is that now I want the total number of times column 4 of the input file DOES NOT fall in the category and the total number of times BOTH column 4 & 3 of each row DO NOT fall in the specified category.