Start Free Trial

asked on

Another question about optimization

Hey,

I need some advice about optimizing some code.

The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.

My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?

I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.

Appreciate any help.

static pixel avg(int dim, int i, int j, pixel *src) 
{
    int ii, jj;
    pixel_sum sum;
    pixel current_pixel;
 
    initialize_pixel_sum(&sum);
    for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++) 
	for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++) 
	    accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
 
    assign_sum_to_pixel(&current_pixel, sum);
    return current_pixel;
}
 
 
void naive_smooth(int dim, pixel *src, pixel *dst) 
{
    int i, j;
 
    for (i = 0; i < dim; i++)
	for (j = 0; j < dim; j++)
	    dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}

Open in new window

You might do better by restructuring the program so that the smooth function does not call the average function.

ASKER

hm... ok.

I am somewhat confused by that code though, where is the division taking place here?

initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

assign_sum_to_pixel(&current_pixel, sum);
return current_pixel;

ASKER

I'm thinking the accumulate_sum function collects the values so far.

And the assign_sum_to_pixel function does not take the number of items that were added to make the sum... so where is the division taking place to find out the average?

could it be in either accumulate_sum or assign_sum_to_pixel?

ASKER

Ok... the average is computed in assign_sum_to_pixel.

There's no way to speed up assign_sum_to_pixel... is there?

And if we can't speed up anything with assign_sum_to_pixel, the only thing we can change is:

for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

correct?

typedef struct {
    int red;
    int green;
    int blue;
    int num;
} pixel_sum;
 
typedef struct {
   unsigned short red;
   unsigned short green;
   unsigned short blue;
} pixel
 
/* 
 * accumulate_sum - Accumulates field values of p in corresponding 
 * fields of sum 
 */
static void accumulate_sum(pixel_sum *sum, pixel p) 
{
    sum->red += (int) p.red;
    sum->green += (int) p.green;
    sum->blue += (int) p.blue;
    sum->num++;
    return;
}
 
/* 
 * assign_sum_to_pixel - Computes averaged pixel value in current_pixel 
 */
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum) 
{
    current_pixel->red = (unsigned short) (sum.red/sum.num);
    current_pixel->green = (unsigned short) (sum.green/sum.num);
    current_pixel->blue = (unsigned short) (sum.blue/sum.num);
    return;
}

Open in new window

If you look at what the code is trying to do, it's averaging all the pixels around any given pixel to produce the new value. So in most cases, you're averaging 9 values, on edges 6 values, and in corners, only 4 values.

Since there is some data re-use, cache blocking will help a bit. The biggest gains would come from collapsing all the functions into one. In that case, the blocking would be more apparent, you wouldn't have all the function calls (a huge win), etc.

If you go the latter route, realize that as you walk the pixels, the former sum contains most (2/3 in most cases) of the sum needed for the current sum. Basically if you use a moving window, you can avoid re-adding the same numbers each time.

ASKER

hm... so what you are saying is... to get the value of pixel #2, I would have to do:

(pixel #1 + pixel #2 + pixel #3)/3 ?

No. Think of the problem as a 2-D array... each pixel is the avg of all the pixels it touches plus itself.

ASKER

oh... now that... complicates things. So to get the value for pixel[1][2], I would have to do:

(pixel[1][1] + pixel[1][3] + pixel[0][2] + pixel[2][2] + pixel[0][1] + pixel[0][3] + pixel[2][1] + pixel[2][3] + pixel[1][2])/9??

ASKER

Also... could you please explain the concept of cache blocking?

Is it the idea of limiting your statements to the limit of the cache memory? or is it a bit more complicated than that, and involves wording your loops just right?

maybe you can average in one dimension, then average in the other dimension

ASKER

>>maybe you can average in one dimension, then average in the other dimension

hm? could you please explain that?

ASKER

Hm... so far we've found out that I should try to take out the avg function and combine smooth and avg?

I was just wondering... would it make any difference if we take out procedure calls and replace them with the body of the function?

static pixel avg(int dim, int i, int j, pixel *src) 
{
    int ii, jj;
    pixel_sum sum;
    pixel current_pixel;
 
    initialize_pixel_sum(&sum); <---- Do we need this?
    for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++) 
	for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++) 
	    accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
 
    assign_sum_to_pixel(&current_pixel, sum);
    return current_pixel;
}

Open in new window

It may make a small difference.
How much of your time is spent in procedure calls?
How much of your time is spent in the body of the function?
How much of your time is spent accessing src?
How much of your time is spent updating sum?
How much of your time is spent in cache misses?
How much of your time is spent doing the same thing more often than you need to?

ASKER

I tried looking up cache blocking...

http://www.cs.berkeley.edu/~richie/cs267/mg/report/node35.html

To be honest... after reading that stuff, I think it just confirmed my original idea about cache blocks... which was limiting your statements so they fit in the cache perfectly.

Am I right?

The same source mentions another kind of blocking.
http://www.cs.berkeley.edu/~richie/cs267/mg/report/node8.html

But why not start with the optimizations that are easiest for you to do first?

You may notice that each element of src is read 9 times.
It may be faster to cut that down to 6 or 2 times.

ASKER

>>You may notice that each element of src is read 9 times. It may be faster to cut that down to 6 or 2 times.

Hm... do you mean to say like store the elements in the array in another variable and when we move from each variable, we only take what is needed?

ASKER

Hm... I've since loop unrolling worked well on the last one, I tried on this one:

But it doesn't seem to be increasing its performance...

Could I please get some more pointers on this?

void smooth(int dim, pixel *src, pixel *dst)
{
    int i, j;
 
    for (i = 0; i < dim; i++)
        for (j = 0; j < dim; j+=8){
          dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
          dst[RIDX(i, (j+1), dim)] = avg(dim, i, (j+1), src);
          dst[RIDX(i, (j+2), dim)] = avg(dim, i, (j+2), src);
          dst[RIDX(i, (j+3), dim)] = avg(dim, i, (j+3), src);
          dst[RIDX(i, (j+4), dim)] = avg(dim, i, (j+4), src);
          dst[RIDX(i, (j+5), dim)] = avg(dim, i, (j+5), src);
          dst[RIDX(i, (j+6), dim)] = avg(dim, i, (j+6), src);
          dst[RIDX(i, (j+7), dim)] = avg(dim, i, (j+7), src);
         }
}

Open in new window

can you make avg faster?
where is your program spending most of its time?
Is there any place where you are making the same computation multiple times when you could be doing it just once?

ASKER

>>Is there any place where you are making the same computation multiple times when you could be doing it just once?

Hm... that would be the avg function, right? if it computes the average of every cell around the given cell... if it is reading 2 cells right next to each other... it would read 6 cells twice... right?

But... if we wanted to save the value of these cells... we would have to incorporate the avg function inside the smooth function, right?

The following code does most of what I talked about earlier. It uses a moving window to calculate series of sums. I checked that it compiles, but it might have logic errors since I didn't run it against any actual images.

Basically, for every row compute the sum of the src plus its neighbors. For the end of the rows, that's only 2 cells, for every other element the sum consists of 3 cells.

Next we need to add the neighbors above and below us to our sum, and compute the average and assign it to the dst. Now, we don't have to compute the partial sums across the entire array before we start computing averages, we can work (in general) 3 rows at a time. The top and bottom rows only require 2 rows of sums. So, we special case the first and rows again.

What does this have to do w/ blocking? Because we're using a moving window across 3 rows, we only need that to remain in cache, rather than the entire array.

One further optimization would be to interleave the calculation of the partial sums with the dst calculations, so you only need a moving window that is 3x3. The trade-off would be that while you would use less memory, you'll have to do more pointer arithmetic to get the right things to happen at the right time. My guess is the following would fit in cache, even for dim == 1K.

static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
    int i;
    *sums++ = *row++ + *row;
    for (i=1; i<dim-1; i++)
        *sums++ = *(row-1) + *row++ + *row;
    *sums++ = *(row-1) + *row;
}
 
void smooth(int dim, pixel *src, pixel *dst) {
    pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
    pixel_sum *r0 = row_sums;
    pixel_sum *r1 = r0+dim;
    pixel_sum *r2;
    int i, j;
 
    do_row_sum(r0, dim, src);
    do_row_sum(r1, dim, src+i*dim);
    // calculate row 0
    *dst++ = (*r0++ + *r1++)/4;
    for (i=1; i<dim-1; i++)
        *dst++ = (*r0++ + *r1++)/6;
    *dst++ = (*r0 + *r1)/4;
 
    // calculate all the middle rows
    for (i=1; i<dim-1; i++) {
        r0 = row_sums+((i-1)%3)*dim;
        r1 = row_sums+(i%3)*dim;
        r2 = row_sums+((i+1)%3)*dim;
        do_row_sum(r2, dim, src+(i+1)*dim);
        *dst++ = (*r0++ + *r1++ + *r2++)/6;
        for (j=1; j<dim-1; j++)
            *dst++ = (*r0++ + *r1++ + *r2++)/9;
        *dst++ = (*r0 + *r1 + *r2)/6;
    }
 
    // calculate the last row
    r0 = row_sums+((i-1)%3)*dim;
    r1 = row_sums+(i%3)*dim;
    *dst++ = (*r0++ + *r1++)/4;
    for (i=1; i<dim-1; i++)
        *dst++ = (*r0++ + *r1++)/6;
    *dst = (*r0 + *r1)/4;
}

Open in new window

probably want to throw the following in before that last closing } ... otherwise it leaks memory.

free(row_sums);

what is the range of values in pixel?

What is the range of values in pixel?
Does it use less than 1/9th of the range of a short?

It doesn't matter... if the original code ran, this will run as well.

there are two different type_defs.

pixel and pixel_sum.

The original code stuffed the sum of all 9 (at most) pixels into pixel_sum, then computed the avg. My code does exactly the same thing.

I understand the original code (although I'm not sure how sizeof(unsigned short) compares to sizeof(int))
the question was about optimizations.

also, how does CHAR_MAX<<(sizeof(int)-sizeof(unsigned short)) compare to dim or dim*dim?

ASKER

first of all, thanks for the code =D

but... I'm getting a few errors, all of them are error: invalid operands to binary +.

I think its because of (*r0++ + *r1++).

Am I right? and if it is, I'd just have to replace that with a call to accumulate_sum, right?

the code I sent compiles fine under gcc ... can you show your current code?

ASKER

Hm, I just copy pasted the whole thing into my program.

Here it is though.

static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
    int i;
    *sums++ = *row++ + *row;
    for (i=1; i<dim-1; i++)
        *sums++ = *(row-1) + *row++ + *row;
    *sums++ = *(row-1) + *row;
}
 
void smooth(int dim, pixel *src, pixel *dst) 
{
    pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
    pixel_sum *r0 = row_sums;
    pixel_sum *r1 = r0+dim;
    pixel_sum *r2;
    int i, j;
 
    do_row_sum(r0, dim, src);
    do_row_sum(r1, dim, src+i*dim);
    // calculate row 0
    *dst++ = (*r0++ + *r1++)/4;
    for (i=1; i<dim-1; i++)
        *dst++ = (*r0++ + *r1++)/6;
    *dst++ = (*r0 + *r1)/4;
 
    // calculate all the middle rows
    for (i=1; i<dim-1; i++) {
        r0 = row_sums+((i-1)%3)*dim;
        r1 = row_sums+(i%3)*dim;
        r2 = row_sums+((i+1)%3)*dim;
        do_row_sum(r2, dim, src+(i+1)*dim);
        *dst++ = (*r0++ + *r1++ + *r2++)/6;
        for (j=1; j<dim-1; j++)
            *dst++ = (*r0++ + *r1++ + *r2++)/9;
        *dst++ = (*r0 + *r1 + *r2)/6;
    }
 
    // calculate the last row
    r0 = row_sums+((i-1)%3)*dim;
    r1 = row_sums+(i%3)*dim;
    *dst++ = (*r0++ + *r1++)/4;
    for (i=1; i<dim-1; i++)
        *dst++ = (*r0++ + *r1++)/6;
    *dst = (*r0 + *r1)/4;
 
    free(row_sums);
}

Open in new window

ASKER

Oh... I'm using a gcc compiler too. I'm using a makefile to compile it though, here's what I'm using.

CC = gcc
CFLAGS = -Wall -O2
LIBS = -lm
 
OBJS = driver.o kernels.o fcyc.o clock.o
 
all: driver
 
driver: $(OBJS) fcyc.h clock.h defs.h config.h
	$(CC) $(CFLAGS) $(OBJS) $(LIBS) -o driver
 
handin:
	cp kernels.c $(HANDINDIR)/$(TEAM)-$(VERSION)-kernels.c
 
clean: 
	-rm -f $(OBJS) driver core *~ *.o

Open in new window

Can you show all the gcc error messages? Because somewhere in that file or an included file pixel_sum and pixel have to be declared. Otherwise gcc would be giving more error messages than just the one you showed.

In fact, to test gcc add the following to the top of my code

#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;

call the whole thing foo.c

this should then compile cleanly

gcc -c -o foo.o foo.c

If it does, it means you're not including this code w/ the right typedefs and includes

ASKER

>>In fact, to test gcc add the following to the top of my code

#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;

ahh... here's the problem.

pixel_sum is not an int, at least not in my program =(

typedef struct {
    int red;
    int green;
    int blue;
    int num;
} pixel_sum;
 
static void initialize_pixel_sum(pixel_sum *sum) 
{
    sum->red = sum->green = sum->blue = 0;
    sum->num = 0;
    return;
}
 
static void accumulate_sum(pixel_sum *sum, pixel p) 
{
    sum->red += (int) p.red;
    sum->green += (int) p.green;
    sum->blue += (int) p.blue;
    sum->num++;
    return;
}

Open in new window

ASKER

>>call the whole thing foo.c

>>this should then compile cleanly

Ok, that does compile =)

>>If it does, it means you're not including this code w/ the right typedefs and includes

But like I said in my previous post, pixel and pixel_sum are sturcts =(.

>>Can you show all the gcc error messages?
fudge[36] [~/perflab-handout/]> gcc -c -o kernels.o kernels.c
kernels.c: In function `do_row_sum':
kernels.c:657: error: invalid operands to binary +
kernels.c:659: error: invalid operands to binary +
kernels.c:660: error: invalid operands to binary +
kernels.c: In function `smooth':
kernels.c:715: error: invalid operands to binary +
kernels.c:717: error: invalid operands to binary +
kernels.c:718: error: invalid operands to binary +
kernels.c:726: error: invalid operands to binary +
kernels.c:728: error: invalid operands to binary +
kernels.c:729: error: invalid operands to binary +
kernels.c:735: error: invalid operands to binary +
kernels.c:737: error: invalid operands to binary +
kernels.c:738: error: invalid operands to binary +

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Hm... got a few more errors =(

kernels.c:669: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined

I think those are referring to this function:
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
add_pixel_2(sums++, row++, row);
for (i=1; i<dim-1; i++)
add_pixel_3(sums++, row-1, row++, row);
add_pixel_2(sums, row-1, row);
}

And apparently the program thinks there's one error here:
ERROR: Dimension=96, 288 errors
E.g.,
You have dst[2][95].{red,green,blue} = {20835,18223,11591}
It should be dst[2][95].{red,green,blue} = {18245,16789,16176}
Benchmark "smooth: Current working version" failed correctness check for dimension 96.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

>>If you replace the do_row_sum() function with this one, does this make the warning on compilation go away?

yes it does.

It works perfectly now =), thanks a lot for your help.

Initially it shows that it increases the performance by a factor of 2:

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 165.4 177.0 180.0 179.5 333.5
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 2.3 2.3 2.2 2.2 1.2 2.0

But it can go much higher right? I ask this because of the silly random number generator our program is getting tested by. After 10 or so runs its improved its performance a bit more:

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 83.8 160.0 171.7 173.4 326.8
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 4.5 2.5 2.3 2.3 1.2 2.4

ASKER

Just wanted to know how high it could go =)

ASKER

See, now it increases performance by 3...

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 40.9 150.4 164.4 169.0 219.1
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 9.3 2.7 2.5 2.4 1.8 3.0

Just wanted to know how high it was designed to improve performance so I know how long I need to recompile and run it =)

I would guess it's just the test server being more or less busy...or cache effects on loading the program. Re-compiling won't help, but running it multiple times in succession might.

Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.

ASKER

>>Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.

Yes, I agree with you there. The mean column doesn't work properly for the rotate function...

>>Re-compiling won't help, but running it multiple times in succession might.
yeah, I meant running it over and over again, so far I've ran it bout 10 times. I just wanted to get your opinion on how fast it can go =), because I'd probably be running this till midnight to see if it can keep topping itself... lol

But the formatting got messed up here, this is what it looks like:

Smooth: Version = smooth: Current working version:
Dim             32      64      128     256     512     Mean
Your CPEs       40.9    150.4   164.4   169.0   219.1
Baseline CPEs   379.5   401.6   403.2   398.8   397.2
Speedup         9.3     2.7     2.5     2.4     1.8     3.0

Open in new window

ah...ok.. properly formatted, it looks ok. I had the columns misaligned in my head. I don't like the drop-off in performance for larger N. Might look at something else later to try to keep more of it in cache.

ASKER

>>Might look at something else later to try to keep more of it in cache.

Ah, ok, thanks =)