errang
asked on
Another question about optimization
Hey,
I need some advice about optimizing some code.
The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.
My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?
I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.
Appreciate any help.
I need some advice about optimizing some code.
The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.
My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?
I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.
Appreciate any help.
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
You might do better by restructuring the program so that the smooth function does not call the average function.
ASKER
hm... ok.
I am somewhat confused by that code though, where is the division taking place here?
initialize_pixel_sum(&sum) ;
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(&curre nt_pixel, sum);
return current_pixel;
I am somewhat confused by that code though, where is the division taking place here?
initialize_pixel_sum(&sum)
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(&curre
return current_pixel;
ASKER
I'm thinking the accumulate_sum function collects the values so far.
And the assign_sum_to_pixel function does not take the number of items that were added to make the sum... so where is the division taking place to find out the average?
And the assign_sum_to_pixel function does not take the number of items that were added to make the sum... so where is the division taking place to find out the average?
could it be in either accumulate_sum or assign_sum_to_pixel?
ASKER
Ok... the average is computed in assign_sum_to_pixel.
There's no way to speed up assign_sum_to_pixel... is there?
And if we can't speed up anything with assign_sum_to_pixel, the only thing we can change is:
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
correct?
There's no way to speed up assign_sum_to_pixel... is there?
And if we can't speed up anything with assign_sum_to_pixel, the only thing we can change is:
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
correct?
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
typedef struct {
unsigned short red;
unsigned short green;
unsigned short blue;
} pixel
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
If you look at what the code is trying to do, it's averaging all the pixels around any given pixel to produce the new value. So in most cases, you're averaging 9 values, on edges 6 values, and in corners, only 4 values.
Since there is some data re-use, cache blocking will help a bit. The biggest gains would come from collapsing all the functions into one. In that case, the blocking would be more apparent, you wouldn't have all the function calls (a huge win), etc.
If you go the latter route, realize that as you walk the pixels, the former sum contains most (2/3 in most cases) of the sum needed for the current sum. Basically if you use a moving window, you can avoid re-adding the same numbers each time.
Since there is some data re-use, cache blocking will help a bit. The biggest gains would come from collapsing all the functions into one. In that case, the blocking would be more apparent, you wouldn't have all the function calls (a huge win), etc.
If you go the latter route, realize that as you walk the pixels, the former sum contains most (2/3 in most cases) of the sum needed for the current sum. Basically if you use a moving window, you can avoid re-adding the same numbers each time.
ASKER
hm... so what you are saying is... to get the value of pixel #2, I would have to do:
(pixel #1 + pixel #2 + pixel #3)/3 ?
(pixel #1 + pixel #2 + pixel #3)/3 ?
No. Think of the problem as a 2-D array... each pixel is the avg of all the pixels it touches plus itself.
ASKER
oh... now that... complicates things. So to get the value for pixel[1][2], I would have to do:
(pixel[1][1] + pixel[1][3] + pixel[0][2] + pixel[2][2] + pixel[0][1] + pixel[0][3] + pixel[2][1] + pixel[2][3] + pixel[1][2])/9??
(pixel[1][1] + pixel[1][3] + pixel[0][2] + pixel[2][2] + pixel[0][1] + pixel[0][3] + pixel[2][1] + pixel[2][3] + pixel[1][2])/9??
ASKER
Also... could you please explain the concept of cache blocking?
Is it the idea of limiting your statements to the limit of the cache memory? or is it a bit more complicated than that, and involves wording your loops just right?
Is it the idea of limiting your statements to the limit of the cache memory? or is it a bit more complicated than that, and involves wording your loops just right?
maybe you can average in one dimension, then average in the other dimension
ASKER
>>maybe you can average in one dimension, then average in the other dimension
hm? could you please explain that?
hm? could you please explain that?
ASKER
Hm... so far we've found out that I should try to take out the avg function and combine smooth and avg?
I was just wondering... would it make any difference if we take out procedure calls and replace them with the body of the function?
I was just wondering... would it make any difference if we take out procedure calls and replace them with the body of the function?
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum); <---- Do we need this?
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
It may make a small difference.
How much of your time is spent in procedure calls?
How much of your time is spent in the body of the function?
How much of your time is spent accessing src?
How much of your time is spent updating sum?
How much of your time is spent in cache misses?
How much of your time is spent doing the same thing more often than you need to?
How much of your time is spent in procedure calls?
How much of your time is spent in the body of the function?
How much of your time is spent accessing src?
How much of your time is spent updating sum?
How much of your time is spent in cache misses?
How much of your time is spent doing the same thing more often than you need to?
ASKER
I tried looking up cache blocking...
http://www.cs.berkeley.edu/~richie/cs267/mg/report/node35.html
To be honest... after reading that stuff, I think it just confirmed my original idea about cache blocks... which was limiting your statements so they fit in the cache perfectly.
Am I right?
http://www.cs.berkeley.edu/~richie/cs267/mg/report/node35.html
To be honest... after reading that stuff, I think it just confirmed my original idea about cache blocks... which was limiting your statements so they fit in the cache perfectly.
Am I right?
The same source mentions another kind of blocking.
http://www.cs.berkeley.edu/~richie/cs267/mg/report/node8.html
But why not start with the optimizations that are easiest for you to do first?
http://www.cs.berkeley.edu/~richie/cs267/mg/report/node8.html
But why not start with the optimizations that are easiest for you to do first?
You may notice that each element of src is read 9 times.
It may be faster to cut that down to 6 or 2 times.
It may be faster to cut that down to 6 or 2 times.
ASKER
>>You may notice that each element of src is read 9 times. It may be faster to cut that down to 6 or 2 times.
Hm... do you mean to say like store the elements in the array in another variable and when we move from each variable, we only take what is needed?
Hm... do you mean to say like store the elements in the array in another variable and when we move from each variable, we only take what is needed?
ASKER
Hm... I've since loop unrolling worked well on the last one, I tried on this one:
But it doesn't seem to be increasing its performance...
Could I please get some more pointers on this?
But it doesn't seem to be increasing its performance...
Could I please get some more pointers on this?
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j+=8){
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
dst[RIDX(i, (j+1), dim)] = avg(dim, i, (j+1), src);
dst[RIDX(i, (j+2), dim)] = avg(dim, i, (j+2), src);
dst[RIDX(i, (j+3), dim)] = avg(dim, i, (j+3), src);
dst[RIDX(i, (j+4), dim)] = avg(dim, i, (j+4), src);
dst[RIDX(i, (j+5), dim)] = avg(dim, i, (j+5), src);
dst[RIDX(i, (j+6), dim)] = avg(dim, i, (j+6), src);
dst[RIDX(i, (j+7), dim)] = avg(dim, i, (j+7), src);
}
}
can you make avg faster?
where is your program spending most of its time?
Is there any place where you are making the same computation multiple times when you could be doing it just once?
where is your program spending most of its time?
Is there any place where you are making the same computation multiple times when you could be doing it just once?
ASKER
>>Is there any place where you are making the same computation multiple times when you could be doing it just once?
Hm... that would be the avg function, right? if it computes the average of every cell around the given cell... if it is reading 2 cells right next to each other... it would read 6 cells twice... right?
But... if we wanted to save the value of these cells... we would have to incorporate the avg function inside the smooth function, right?
Hm... that would be the avg function, right? if it computes the average of every cell around the given cell... if it is reading 2 cells right next to each other... it would read 6 cells twice... right?
But... if we wanted to save the value of these cells... we would have to incorporate the avg function inside the smooth function, right?
The following code does most of what I talked about earlier. It uses a moving window to calculate series of sums. I checked that it compiles, but it might have logic errors since I didn't run it against any actual images.
Basically, for every row compute the sum of the src plus its neighbors. For the end of the rows, that's only 2 cells, for every other element the sum consists of 3 cells.
Next we need to add the neighbors above and below us to our sum, and compute the average and assign it to the dst. Now, we don't have to compute the partial sums across the entire array before we start computing averages, we can work (in general) 3 rows at a time. The top and bottom rows only require 2 rows of sums. So, we special case the first and rows again.
What does this have to do w/ blocking? Because we're using a moving window across 3 rows, we only need that to remain in cache, rather than the entire array.
One further optimization would be to interleave the calculation of the partial sums with the dst calculations, so you only need a moving window that is 3x3. The trade-off would be that while you would use less memory, you'll have to do more pointer arithmetic to get the right things to happen at the right time. My guess is the following would fit in cache, even for dim == 1K.
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
*sums++ = *row++ + *row;
for (i=1; i<dim-1; i++)
*sums++ = *(row-1) + *row++ + *row;
*sums++ = *(row-1) + *row;
}
void smooth(int dim, pixel *src, pixel *dst) {
pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
pixel_sum *r0 = row_sums;
pixel_sum *r1 = r0+dim;
pixel_sum *r2;
int i, j;
do_row_sum(r0, dim, src);
do_row_sum(r1, dim, src+i*dim);
// calculate row 0
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst++ = (*r0 + *r1)/4;
// calculate all the middle rows
for (i=1; i<dim-1; i++) {
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
r2 = row_sums+((i+1)%3)*dim;
do_row_sum(r2, dim, src+(i+1)*dim);
*dst++ = (*r0++ + *r1++ + *r2++)/6;
for (j=1; j<dim-1; j++)
*dst++ = (*r0++ + *r1++ + *r2++)/9;
*dst++ = (*r0 + *r1 + *r2)/6;
}
// calculate the last row
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst = (*r0 + *r1)/4;
}
probably want to throw the following in before that last closing } ... otherwise it leaks memory.
free(row_sums);
what is the range of values in pixel?
What is the range of values in pixel?
Does it use less than 1/9th of the range of a short?
Does it use less than 1/9th of the range of a short?
It doesn't matter... if the original code ran, this will run as well.
there are two different type_defs.
pixel and pixel_sum.
The original code stuffed the sum of all 9 (at most) pixels into pixel_sum, then computed the avg. My code does exactly the same thing.
I understand the original code (although I'm not sure how sizeof(unsigned short) compares to sizeof(int))
the question was about optimizations.
the question was about optimizations.
also, how does CHAR_MAX<<(sizeof(int)-siz eof(unsign ed short)) compare to dim or dim*dim?
ASKER
first of all, thanks for the code =D
but... I'm getting a few errors, all of them are error: invalid operands to binary +.
I think its because of (*r0++ + *r1++).
Am I right? and if it is, I'd just have to replace that with a call to accumulate_sum, right?
but... I'm getting a few errors, all of them are error: invalid operands to binary +.
I think its because of (*r0++ + *r1++).
Am I right? and if it is, I'd just have to replace that with a call to accumulate_sum, right?
the code I sent compiles fine under gcc ... can you show your current code?
ASKER
Hm, I just copy pasted the whole thing into my program.
Here it is though.
Here it is though.
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
*sums++ = *row++ + *row;
for (i=1; i<dim-1; i++)
*sums++ = *(row-1) + *row++ + *row;
*sums++ = *(row-1) + *row;
}
void smooth(int dim, pixel *src, pixel *dst)
{
pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
pixel_sum *r0 = row_sums;
pixel_sum *r1 = r0+dim;
pixel_sum *r2;
int i, j;
do_row_sum(r0, dim, src);
do_row_sum(r1, dim, src+i*dim);
// calculate row 0
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst++ = (*r0 + *r1)/4;
// calculate all the middle rows
for (i=1; i<dim-1; i++) {
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
r2 = row_sums+((i+1)%3)*dim;
do_row_sum(r2, dim, src+(i+1)*dim);
*dst++ = (*r0++ + *r1++ + *r2++)/6;
for (j=1; j<dim-1; j++)
*dst++ = (*r0++ + *r1++ + *r2++)/9;
*dst++ = (*r0 + *r1 + *r2)/6;
}
// calculate the last row
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst = (*r0 + *r1)/4;
free(row_sums);
}
ASKER
Oh... I'm using a gcc compiler too. I'm using a makefile to compile it though, here's what I'm using.
CC = gcc
CFLAGS = -Wall -O2
LIBS = -lm
OBJS = driver.o kernels.o fcyc.o clock.o
all: driver
driver: $(OBJS) fcyc.h clock.h defs.h config.h
$(CC) $(CFLAGS) $(OBJS) $(LIBS) -o driver
handin:
cp kernels.c $(HANDINDIR)/$(TEAM)-$(VERSION)-kernels.c
clean:
-rm -f $(OBJS) driver core *~ *.o
Can you show all the gcc error messages? Because somewhere in that file or an included file pixel_sum and pixel have to be declared. Otherwise gcc would be giving more error messages than just the one you showed.
In fact, to test gcc add the following to the top of my code
#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;
call the whole thing foo.c
this should then compile cleanly
gcc -c -o foo.o foo.c
If it does, it means you're not including this code w/ the right typedefs and includes
ASKER
>>In fact, to test gcc add the following to the top of my code
#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;
ahh... here's the problem.
pixel_sum is not an int, at least not in my program =(
#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;
ahh... here's the problem.
pixel_sum is not an int, at least not in my program =(
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
ASKER
>>call the whole thing foo.c
>>this should then compile cleanly
Ok, that does compile =)
>>If it does, it means you're not including this code w/ the right typedefs and includes
But like I said in my previous post, pixel and pixel_sum are sturcts =(.
>>Can you show all the gcc error messages?
fudge[36] [~/perflab-handout/]> gcc -c -o kernels.o kernels.c
kernels.c: In function `do_row_sum':
kernels.c:657: error: invalid operands to binary +
kernels.c:659: error: invalid operands to binary +
kernels.c:660: error: invalid operands to binary +
kernels.c: In function `smooth':
kernels.c:715: error: invalid operands to binary +
kernels.c:717: error: invalid operands to binary +
kernels.c:718: error: invalid operands to binary +
kernels.c:726: error: invalid operands to binary +
kernels.c:728: error: invalid operands to binary +
kernels.c:729: error: invalid operands to binary +
kernels.c:735: error: invalid operands to binary +
kernels.c:737: error: invalid operands to binary +
kernels.c:738: error: invalid operands to binary +
>>this should then compile cleanly
Ok, that does compile =)
>>If it does, it means you're not including this code w/ the right typedefs and includes
But like I said in my previous post, pixel and pixel_sum are sturcts =(.
>>Can you show all the gcc error messages?
fudge[36] [~/perflab-handout/]> gcc -c -o kernels.o kernels.c
kernels.c: In function `do_row_sum':
kernels.c:657: error: invalid operands to binary +
kernels.c:659: error: invalid operands to binary +
kernels.c:660: error: invalid operands to binary +
kernels.c: In function `smooth':
kernels.c:715: error: invalid operands to binary +
kernels.c:717: error: invalid operands to binary +
kernels.c:718: error: invalid operands to binary +
kernels.c:726: error: invalid operands to binary +
kernels.c:728: error: invalid operands to binary +
kernels.c:729: error: invalid operands to binary +
kernels.c:735: error: invalid operands to binary +
kernels.c:737: error: invalid operands to binary +
kernels.c:738: error: invalid operands to binary +
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hm... got a few more errors =(
kernels.c:669: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
I think those are referring to this function:
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
add_pixel_2(sums++, row++, row);
for (i=1; i<dim-1; i++)
add_pixel_3(sums++, row-1, row++, row);
add_pixel_2(sums, row-1, row);
}
And apparently the program thinks there's one error here:
ERROR: Dimension=96, 288 errors
E.g.,
You have dst[2][95].{red,green,blue } = {20835,18223,11591}
It should be dst[2][95].{red,green,blue } = {18245,16789,16176}
Benchmark "smooth: Current working version" failed correctness check for dimension 96.
kernels.c:669: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
I think those are referring to this function:
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
add_pixel_2(sums++, row++, row);
for (i=1; i<dim-1; i++)
add_pixel_3(sums++, row-1, row++, row);
add_pixel_2(sums, row-1, row);
}
And apparently the program thinks there's one error here:
ERROR: Dimension=96, 288 errors
E.g.,
You have dst[2][95].{red,green,blue
It should be dst[2][95].{red,green,blue
Benchmark "smooth: Current working version" failed correctness check for dimension 96.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
>>If you replace the do_row_sum() function with this one, does this make the warning on compilation go away?
yes it does.
It works perfectly now =), thanks a lot for your help.
Initially it shows that it increases the performance by a factor of 2:
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 165.4 177.0 180.0 179.5 333.5
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 2.3 2.3 2.2 2.2 1.2 2.0
But it can go much higher right? I ask this because of the silly random number generator our program is getting tested by. After 10 or so runs its improved its performance a bit more:
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 83.8 160.0 171.7 173.4 326.8
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 4.5 2.5 2.3 2.3 1.2 2.4
yes it does.
It works perfectly now =), thanks a lot for your help.
Initially it shows that it increases the performance by a factor of 2:
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 165.4 177.0 180.0 179.5 333.5
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 2.3 2.3 2.2 2.2 1.2 2.0
But it can go much higher right? I ask this because of the silly random number generator our program is getting tested by. After 10 or so runs its improved its performance a bit more:
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 83.8 160.0 171.7 173.4 326.8
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 4.5 2.5 2.3 2.3 1.2 2.4
ASKER
Just wanted to know how high it could go =)
ASKER
See, now it increases performance by 3...
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 40.9 150.4 164.4 169.0 219.1
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 9.3 2.7 2.5 2.4 1.8 3.0
Just wanted to know how high it was designed to improve performance so I know how long I need to recompile and run it =)
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 40.9 150.4 164.4 169.0 219.1
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 9.3 2.7 2.5 2.4 1.8 3.0
Just wanted to know how high it was designed to improve performance so I know how long I need to recompile and run it =)
I would guess it's just the test server being more or less busy...or cache effects on loading the program. Re-compiling won't help, but running it multiple times in succession might.
Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.
ASKER
>>Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.
Yes, I agree with you there. The mean column doesn't work properly for the rotate function...
>>Re-compiling won't help, but running it multiple times in succession might.
yeah, I meant running it over and over again, so far I've ran it bout 10 times. I just wanted to get your opinion on how fast it can go =), because I'd probably be running this till midnight to see if it can keep topping itself... lol
But the formatting got messed up here, this is what it looks like:
Yes, I agree with you there. The mean column doesn't work properly for the rotate function...
>>Re-compiling won't help, but running it multiple times in succession might.
yeah, I meant running it over and over again, so far I've ran it bout 10 times. I just wanted to get your opinion on how fast it can go =), because I'd probably be running this till midnight to see if it can keep topping itself... lol
But the formatting got messed up here, this is what it looks like:
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 40.9 150.4 164.4 169.0 219.1
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 9.3 2.7 2.5 2.4 1.8 3.0
ah...ok.. properly formatted, it looks ok. I had the columns misaligned in my head. I don't like the drop-off in performance for larger N. Might look at something else later to try to keep more of it in cache.
ASKER
>>Might look at something else later to try to keep more of it in cache.
Ah, ok, thanks =)
Ah, ok, thanks =)