Solved

# collecting elements of data

Posted on 2009-07-12
205 Views
ok
now the output I am getting is this :
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6

now I have to group the average values, and count how many of those values fall into each group, and show it as a graph, somehow

for example, the first averages there are 571.50, 551.50, 551.17, and 531.17
so for this set of data the groups would be
500 - 510  0
511 - 520  0
521 - 530  0
531 - 540  1
541 - 550  0
551 - 560  2
561 - 570  0
571 - 580  1
581 - 590  0
591 - 600  0

then if I could show that as a bar chart, that would be great. I could do that in MS excel if it is easier.

the thing is though, my data sets use quite a range of data, so if possible, I would like to be able to change the groupings.
0
Question by:MichaelGlancy
• 22
• 10
• 5
• +1

Author Comment

I have to search through this output data and open a new file and put the new data into that
0

LVL 39

Expert Comment

When your groups are uniformly spaced like this, you can determine which group a number is in by subtracting the minimum, then dividing by the group size, rounding down.

Here is some code showing this calculation.  You would put lines 5-8 in your existing program.  The rest is there to demonstrate how it works.
``````my @averages = qw(571.50 551.50 551.17 531.17);

my @groups = (0)x10;

foreach (@averages) {

my \$group = (\$_ - 501)/10;   #This calculates the group

\$group = 0 if \$group < 0;    #In case you have a number below the expected min

\$group = 9 if \$group > 9;    #In case you have a number above the expected max

\$groups[\$group]++;

}

print join(", ", @groups) . "\n";
``````
0

LVL 84

Expert Comment

#if you don't know the range beforehand, you might do something like
my(\$min,\$max)=@averages[0,0];
my %groups;
for( @averages ){
\$groups{int \$_/10}++;
\$min=\$_ if \$_<\$min;
\$max=\$_ if \$_>\$max;
}
for( \$min/10..\$max/10 ){
printf"%d - %d  %d\n",\$_*10,(\$_+1)*10,\$groups{\$_}
}
0

LVL 7

Expert Comment

As for graphing, look at getting and using the GD::Graph module:

http://search.cpan.org/~bwarfield/GDGraph-1.44/Graph.pm
0

Author Comment

i will know the ranges.

some of them would be 1-1000, 501-550, and probably 1 - 70

I have millions of lines of data to sort through, so if the code could read each line into an array, and I could manually change the range, and output the results to another file  ?
0

LVL 39

Expert Comment

>>if the code could read each line into an array, and I could manually change the range, and output the results to another file
Not really sure what you want the code to do.  Do you want to manually change the range?
Or should the range be selected from one of your predefined ranges (1-1000, 501-550, 1-70) based on the numbers?
Should a range be automatically chosen based on the numbers?
0

Author Comment

well I could choose the range before I start if you make it obvious in the code.

I want the code to read a line of numbers, decide which numbers fall into which groups, and tally the amount of numbers which has fallen into each group.

Does that make better sense ?
0

LVL 39

Expert Comment

Here is some sample code.  You would set \$Group_Min, \$Group_Max and \$Group_Count based on how you want your groups.

For any number, to determine which group it is in:
1) Subtract the \$Group_Min
2) Divide (1) by \$Group_Count
3) Clip (2) to the range 0..\$Group_Count-1

``````my \$Group_Min = 501;

my \$Group_Max = 550;

my \$Group_Count = 10;

my @averages = qw(571.50 551.50 551.17 531.17);

my @groups = (0)x\$Group_Count;

foreach (@averages) {

my \$group = (\$_ - \$Group_Min)/\$Group_Count;   #This calculates the group

\$group = 0 if \$group < 0;    #In case you have a number below the expected min

\$group = \$Group_Count-1 if \$group >= \$Group_Count;    #In case you have a number above the expected max

\$groups[\$group]++;

}

print join(", ", @groups) . "\n";
``````
0

Author Comment

ok

the my @averages = qw(571.50 551.50 551.17 531.17);

would this not be taken from a text file ?
0

Author Comment

so would that look like this in code to open data file and write to another file ?

use strict;
use warnings;

##### Open files

open(my \$IN,"outvim2.vim") or die "Could not open input: \$!\n";
open(my \$OUT,">groupsort.vim") or die "Could not open output: \$!\n";

my \$Group_Min = 501;
my \$Group_Max = 550;
my \$Group_Count = 10;

my @averages = qw(571.50 551.50 551.17 531.17);

my @groups = (0)x\$Group_Count;
foreach (@averages) {
my \$group = (\$_ - \$Group_Min)/\$Group_Count;   #This calculates the group
\$group = 0 if \$group < 0;    #In case you have a number below the expected min
\$group = \$Group_Count-1 if \$group >= \$Group_Count;    #In case you have a number above the expected max
\$groups[\$group]++;
}

print OUT join(", ", @groups) . "\n"
0

LVL 39

Expert Comment

Based on your other posts, I would think the averages would come from other processing.  If you had a variable like \$average that contained the average, and you wanted to count the groups for that:

``````my \$Group_Min = 501;

my \$Group_Max = 550;

my \$Group_Count = 10;

while(<>) {    #This is in your existing program - loops over lines of a file

...

Calculate \$average

##### To count the groups for this:

my \$ThisGroup = (\$average - \$Group_Min) / \$Group_Count;

\$ThisGroup = 0 if \$ThisGroup < 0;

\$ThisGroup = \$Group_Count - 1 if \$ThisGroup >= \$Group_Count;

\$groups[\$ThisGroup]++;

##### You could then print @groups, or do whatever

}
``````
0

Author Comment

Any advice on this code ? it is returning loads of errors

use strict;
use warnings;

##### Open files

open(my \$IN,"outvim2.vim") or die "Could not open input: \$!\n";
open(my \$OUT,">groupsort.vim") or die "Could not open output: \$!\n";

my \$Group_Min = 100;
my \$Group_Max = 600;
my \$Group_Count = 10;

while(<>) {    #This is in your existing program - loops over lines of a file
...
Calculate \$average

##### To count the groups for this

my \$ThisGroup = (\$average - \$Group_Min) / \$Group_Count;
\$ThisGroup = 0 if \$ThisGroup < 0;
\$ThisGroup = \$Group_Count - 1 if \$ThisGroup >= \$Group_Count;
\$groups[\$ThisGroup]++;

##### You could then print @groups, or do whatever

}

print \$OUT (@groups) "\n";

printf (@groups1);
printf \$OUT (@groups1;

0

LVL 39

Expert Comment

Is this going to be it's own program, or a piece of an existing program?  I thought it was going to be part of an existing program.

If it's on it's own, where does it get the average?  What format is that file?

If it's part of another program, post that other program.
0

Author Comment

oh I see, Im sorry, I didnt explain this entirely.

I am using another program to sort data. Then with the sorted data, I want to run this code as a program on its own. The reason is that the first program I am running takes a lot of time, so i thought it would be better to separate out the tasks needing done to different programs. So yes, I need this code to run and process as an individual program.

>If it's on it's own, where does it get the average?
there is a list of 10 million lines, with the average written there. One to each line sorted from the previous data, so I need this code to read each line (which is an average of other data) and group that average value.

>What format is that file?
text files , I am using Vim

I need to know how many of the average values fall into each grouping
0

LVL 39

Expert Comment

Post a sample of the file that is the input to this program.
0

LVL 39

Expert Comment

Is it this, from your original question:
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6

So what average should be used?  The avg1?  The avg2?  Both?

What format do you want the output?
0

Author Comment

yes, that is the format of the input file.

I am not wanting to calculate an average. I want to count how many averages fall within:
with this format :

group 1 1-100:      (total number of values in this group (avg1)) : (total number of values in this group (avg2))
group 2 101 -200:  (total number of values in this group (avg1)) : (total number of values in this group (avg2))
....
group 10 901 - 1000:  (total number of values in this group (avg1)) : (total number of values in this group (avg2))

I need to see if there is a pattern to the average values
0

Author Comment

thankyou for your patience, describing this is more tricky than I thought
0

Author Comment

is this too much guys ?
0

LVL 39

Expert Comment

No need to delete.  In order to get the code to do what you need, we need to know precisely what you need.  This back and forth is not uncommon.

Save this to a file.  Call with the name of your input on command line.  For example, if you save this to script.pl, and your input in invim.vim, call like:
script.pl invim.vim

``````#!/usr/bin/perl

use strict;

use warnings;

use Data::Dumper;

my \$Group_Min = 501;

my \$Group_Max = 550;

my \$Group_Count = 10;

my (@groups1, @groups2);   #Store number in group of avg1 and avg2

@\$_ = (0)x\$Group_Count foreach (\@groups1, \@groups2);

while(<>) {

next unless /avg1,([\d\.]+).*avg2,([\d\.]+)/;

my (\$avg1, \$avg2) = (\$1, \$2);

print "avg1=\$avg1     avg2=\$avg2\n";

\$groups1[Clip((\$avg1 - \$Group_Min)/\$Group_Count)]++;

\$groups2[Clip((\$avg2 - \$Group_Min)/\$Group_Count)]++;

}

print "Average 1 groups: " , join("  ", @groups1) . "\n";

print "Average 2 groups: " , join("  ", @groups2) . "\n";

sub Clip {

return 0 if \$_[0] < 0;

return \$Group_Count-1 if \$_[0]>=\$Group_Count;

return \$_[0];

}
``````
0

Author Comment

its ok, it doesnt need deleted
0

Author Comment

Ok that last script is heading in the right direction for me. Thankyou very much.

Would it be possible to actually define the groups in the script. By that I mean could it be written so that the groups are defined as in maybe
1-100,101-200, 201-300, and so on, and then count how many values fall into each group.

Then, for example, the output could be like this :
group 1 - 3
group 2 - 5
group 3 - 1
and so on through all the groups ?

If the groups were actually defined like this, then I could change them as required.

for example, the code reads a line from the input file,
the first average is 530, so that increments the group 501-600 by one,
then the code moves to the next line and does the same, but maybe finds the first average to be 304
so the group 301-400 gets incremented by one

I hope this is clearer
0

LVL 84

Expert Comment

\$[=1;
my @groups = split/,/,"1-100,101-200, 201-300";
s/(\d+)\D+(\d+)/sub{\\$_[0]>=\$1&&\\$_[0]<=\$2}/ and \$_=eval for @groups;
my @averages = qw(571.50 551.50 551.17 531.17);
my @group;
for my \$a (@averages){
\$group[\$_]++ for grep{\$groups[\$_](\$a)}\$[..\$#groups;
}
print "group \$_ - \$group[\$_]\n" for \$[..\$#group;
0

Author Comment

we have a problem

this:
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6
is just an example of 10-20 million lines of data.
this doesnt look right to me: my @averages = qw(571.50 551.50 551.17 531.17); I dont need these defined.

I will gather some more and post
0

Author Comment

right, that last code from Adam was almost there.

this is the data, I have millions of lines of it:

232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6

I need to group the values and count how many values in that group. the output should look like this:

Group 1: (501-510) : 0
Group 2: (511-520) : 0
Group 3: (521-530) : 2
Group 4: (531-540) : 0
Group 5: (541-550) : 0
Group 6: (551-560) : 4
Group 7: (561-570) : 0
Group 8: (571-580) : 2
Group 9: (571-590) : 0
Group 10: (581-600) : 0

total count : 8

Now as I have several data groups to choose from, having the groups defined in the code would allow me to redefine the groups myself.
0

Author Comment

that was for avg1, the same is needed for avg, maybe side by side or in another file
0

LVL 84

Expert Comment

my \$groups="1-100,101-200, 201-300,500-550,550-600";
my \$g0=0;
\$groups =~ s/(\d+)\D*?-\D*(\d+)\D*/++\\$group{@{[++\$g0]}}if \\$_ >= \$1 && \\$_<=\$2;/g;
\$groups=eval"sub{\$groups}";
while( <DATA> ){
&\$groups for /avg[12],([\d.]+)/g;
}
print "group \$_ - \$group{\$_}\n" for sort keys %group;
__DATA__
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6
0

Author Comment

ozo, where does that code go in terms of putting it in a file ?
0

LVL 84

Assisted Solution

ozo earned 200 total points
my \$groups="501-510,511-520,521-530,531-540,541-550,551-560,561-5790,571-580,571-590,581-600";
my \$g0=0;
\$groups =~ s/(\d+)\D*?-\D*(\d+)\D*/++\\$group{"@{[++\$g0]}: (\$1-\$2)"}if \\$_ >= \$1 && \\$_<=\$2;/g;
\$groups=eval"sub{\$groups;\\$total++}";
our \$total;
our %group;
while( <DATA> ){
&\$groups for /avg1,([\d.]+)/g;
}
print "Group \$_ \$group{\$_}\n" for sort {\$a<=>\$b} keys %group;
print "\ntotal count : \$total\n"
__DATA__
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6
232,356,434,767,786,854,avg1,571.50,0011100210,124,78,333,19,68,avg2,124.40
232,336,454,716,727,844,avg1,551.50,0011100210,104,118,262,11,117,avg2,122.40
234,239,316,768,796,954,avg1,551.17,0021000201,5,77,452,28,158,avg2,144.00
134,256,512,716,717,852,avg1,531.17,0110010210,122,256,204,1,135,avg2,143.6
0

Author Comment

I have to read the data from another text file
0

Author Comment

hey guys, can I just say this is not an academic dishonesty issue, this is not an assignment. I am post graduate, doing a completely unrelated project. I have taken the initiative to use and hopefully learn perl. I dont have to use it and this stuff isnt part of any assessment.
0

LVL 84

Expert Comment

to read from @ARGV, change <DATA> to <>
0

LVL 39

Accepted Solution

The grouping logic is based on what ozo gave, but modified to allow for separate groups for avg1 and avg2.  This gets it's data from an input file, like what I posted before.

``````#!/usr/bin/perl

use strict;

use warnings;

use Data::Dumper;

##### Define your groups ranges here

my \$groups1="501-510,511-520,521-530,531-540,541-550,551-560,561-570,571-580,581-590,591-600";

my \$groups2="101-110,111-120,121-130,131-140,141-150,151-160,161-170,171-180,181-190,191-200";

##### Convert string to function

my (%groups1, %groups2, \$total);

my \$g0=0;

\$groups1 =~ s/(\d+)\D*?-\D*(\d+)\D*/++\\$groups1{"@{[++\$g0]}: (\$1-\$2)"}if \\$_[0] >= \$1 && \\$_[0]<=\$2;\n/g;

\$g0=0;

\$groups2 =~ s/(\d+)\D*?-\D*(\d+)\D*/++\\$groups2{"@{[++\$g0]}: (\$1-\$2)"}if \\$_[0] >= \$1 && \\$_[0]<=\$2;\n/g;

{

no strict;

\$groups1=eval"sub{\$groups1;\\$total++}";

\$groups2=eval"sub{\$groups2}";

}

while(<>) {

next unless /avg1,([\d\.]+).*avg2,([\d\.]+)/;

my (\$avg1, \$avg2) = (\$1, \$2);

\$groups1->(\$avg1);

\$groups2->(\$avg2);

}

print "Average1 grouping:\n";

foreach (sort keys %groups1) {

print "  \$_ = \$groups1{\$_}\n";

}

print "Average2 grouping:\n";

foreach (sort keys %groups2) {

print "  \$_ = \$groups2{\$_}\n";

}

print "Total: \$total\n";
``````
0

Author Comment

this is what Im getting
untitled.JPG
0

LVL 39

Expert Comment

That means that there were no averages found in the ranges specified.  Are the ranges in the code (line 7 and 8) appropriate for that file?  Does the file you processed look like your example?  Can you attach a sample of the input file you used.
0

Author Comment

hold it, wrong data group :-) sorry
0

Author Comment

ok, that works fine for my data set.

who gets the points ?
0

Author Closing Comment

thanks again
0

## Featured Post

### Suggested Solutions

Perl Tutorial 6 125
Perl - Mawk 2 69
perl CPAN issue 3 90
Perl Script to read file contents, make changes, and build new file 7 81
I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use mâ€¦
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08â€¦
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: Tâ€¦
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.