Traveling this summer?Check out our on-demand webinar to learn about the importance of Wi-Fi security and 3 easy measures you can start taking immediately to protect your private data while using public Wi-Fi. Follow us today to learn more!

Hey,

I need some advice about optimizing some code.

The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.

My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?

I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.

Appreciate any help.

I need some advice about optimizing some code.

The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.

My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?

I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.

Appreciate any help.

```
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
```

I am somewhat confused by that code though, where is the division taking place here?

initialize_pixel_sum(&sum)

for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)

for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)

accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

assign_sum_to_pixel(&curre

return current_pixel;

And the assign_sum_to_pixel function does not take the number of items that were added to make the sum... so where is the division taking place to find out the average?

There's no way to speed up assign_sum_to_pixel... is there?

And if we can't speed up anything with assign_sum_to_pixel, the only thing we can change is:

for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)

for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)

accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

correct?

```
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
typedef struct {
unsigned short red;
unsigned short green;
unsigned short blue;
} pixel
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
```

Since there is some data re-use, cache blocking will help a bit. The biggest gains would come from collapsing all the functions into one. In that case, the blocking would be more apparent, you wouldn't have all the function calls (a huge win), etc.

If you go the latter route, realize that as you walk the pixels, the former sum contains most (2/3 in most cases) of the sum needed for the current sum. Basically if you use a moving window, you can avoid re-adding the same numbers each time.

(pixel #1 + pixel #2 + pixel #3)/3 ?

No. Think of the problem as a 2-D array... each pixel is the avg of all the pixels it touches plus itself.

(pixel[1][1] + pixel[1][3] + pixel[0][2] + pixel[2][2] + pixel[0][1] + pixel[0][3] + pixel[2][1] + pixel[2][3] + pixel[1][2])/9??

Is it the idea of limiting your statements to the limit of the cache memory? or is it a bit more complicated than that, and involves wording your loops just right?

hm? could you please explain that?

I was just wondering... would it make any difference if we take out procedure calls and replace them with the body of the function?

```
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum); <---- Do we need this?
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
```

How much of your time is spent in procedure calls?

How much of your time is spent in the body of the function?

How much of your time is spent accessing src?

How much of your time is spent updating sum?

How much of your time is spent in cache misses?

How much of your time is spent doing the same thing more often than you need to?

http://www.cs.berkeley.edu/~richie/cs267/mg/report/node35.html

To be honest... after reading that stuff, I think it just confirmed my original idea about cache blocks... which was limiting your statements so they fit in the cache perfectly.

Am I right?

http://www.cs.berkeley.edu/~richie/cs267/mg/report/node8.html

But why not start with the optimizations that are easiest for you to do first?

It may be faster to cut that down to 6 or 2 times.

Hm... do you mean to say like store the elements in the array in another variable and when we move from each variable, we only take what is needed?

But it doesn't seem to be increasing its performance...

Could I please get some more pointers on this?

```
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j+=8){
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
dst[RIDX(i, (j+1), dim)] = avg(dim, i, (j+1), src);
dst[RIDX(i, (j+2), dim)] = avg(dim, i, (j+2), src);
dst[RIDX(i, (j+3), dim)] = avg(dim, i, (j+3), src);
dst[RIDX(i, (j+4), dim)] = avg(dim, i, (j+4), src);
dst[RIDX(i, (j+5), dim)] = avg(dim, i, (j+5), src);
dst[RIDX(i, (j+6), dim)] = avg(dim, i, (j+6), src);
dst[RIDX(i, (j+7), dim)] = avg(dim, i, (j+7), src);
}
}
```

where is your program spending most of its time?

Is there any place where you are making the same computation multiple times when you could be doing it just once?

Hm... that would be the avg function, right? if it computes the average of every cell around the given cell... if it is reading 2 cells right next to each other... it would read 6 cells twice... right?

But... if we wanted to save the value of these cells... we would have to incorporate the avg function inside the smooth function, right?

The following code does most of what I talked about earlier. It uses a moving window to calculate series of sums. I checked that it compiles, but it might have logic errors since I didn't run it against any actual images.

Basically, for every row compute the sum of the src plus its neighbors. For the end of the rows, that's only 2 cells, for every other element the sum consists of 3 cells.

Next we need to add the neighbors above and below us to our sum, and compute the average and assign it to the dst. Now, we don't have to compute the partial sums across the entire array before we start computing averages, we can work (in general) 3 rows at a time. The top and bottom rows only require 2 rows of sums. So, we special case the first and rows again.

What does this have to do w/ blocking? Because we're using a moving window across 3 rows, we only need that to remain in cache, rather than the entire array.

One further optimization would be to interleave the calculation of the partial sums with the dst calculations, so you only need a moving window that is 3x3. The trade-off would be that while you would use less memory, you'll have to do more pointer arithmetic to get the right things to happen at the right time. My guess is the following would fit in cache, even for dim == 1K.

```
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
*sums++ = *row++ + *row;
for (i=1; i<dim-1; i++)
*sums++ = *(row-1) + *row++ + *row;
*sums++ = *(row-1) + *row;
}
void smooth(int dim, pixel *src, pixel *dst) {
pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
pixel_sum *r0 = row_sums;
pixel_sum *r1 = r0+dim;
pixel_sum *r2;
int i, j;
do_row_sum(r0, dim, src);
do_row_sum(r1, dim, src+i*dim);
// calculate row 0
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst++ = (*r0 + *r1)/4;
// calculate all the middle rows
for (i=1; i<dim-1; i++) {
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
r2 = row_sums+((i+1)%3)*dim;
do_row_sum(r2, dim, src+(i+1)*dim);
*dst++ = (*r0++ + *r1++ + *r2++)/6;
for (j=1; j<dim-1; j++)
*dst++ = (*r0++ + *r1++ + *r2++)/9;
*dst++ = (*r0 + *r1 + *r2)/6;
}
// calculate the last row
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst = (*r0 + *r1)/4;
}
```

probably want to throw the following in before that last closing } ... otherwise it leaks memory.

free(row_sums);

It doesn't matter... if the original code ran, this will run as well.

there are two different type_defs.

pixel and pixel_sum.

The original code stuffed the sum of all 9 (at most) pixels into pixel_sum, then computed the avg. My code does exactly the same thing.

the question was about optimizations.

but... I'm getting a few errors, all of them are error: invalid operands to binary +.

I think its because of (*r0++ + *r1++).

Am I right? and if it is, I'd just have to replace that with a call to accumulate_sum, right?

Here it is though.

```
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
*sums++ = *row++ + *row;
for (i=1; i<dim-1; i++)
*sums++ = *(row-1) + *row++ + *row;
*sums++ = *(row-1) + *row;
}
void smooth(int dim, pixel *src, pixel *dst)
{
pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
pixel_sum *r0 = row_sums;
pixel_sum *r1 = r0+dim;
pixel_sum *r2;
int i, j;
do_row_sum(r0, dim, src);
do_row_sum(r1, dim, src+i*dim);
// calculate row 0
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst++ = (*r0 + *r1)/4;
// calculate all the middle rows
for (i=1; i<dim-1; i++) {
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
r2 = row_sums+((i+1)%3)*dim;
do_row_sum(r2, dim, src+(i+1)*dim);
*dst++ = (*r0++ + *r1++ + *r2++)/6;
for (j=1; j<dim-1; j++)
*dst++ = (*r0++ + *r1++ + *r2++)/9;
*dst++ = (*r0 + *r1 + *r2)/6;
}
// calculate the last row
r0 = row_sums+((i-1)%3)*dim;
r1 = row_sums+(i%3)*dim;
*dst++ = (*r0++ + *r1++)/4;
for (i=1; i<dim-1; i++)
*dst++ = (*r0++ + *r1++)/6;
*dst = (*r0 + *r1)/4;
free(row_sums);
}
```

```
CC = gcc
CFLAGS = -Wall -O2
LIBS = -lm
OBJS = driver.o kernels.o fcyc.o clock.o
all: driver
driver: $(OBJS) fcyc.h clock.h defs.h config.h
$(CC) $(CFLAGS) $(OBJS) $(LIBS) -o driver
handin:
cp kernels.c $(HANDINDIR)/$(TEAM)-$(VERSION)-kernels.c
clean:
-rm -f $(OBJS) driver core *~ *.o
```

Can you show all the gcc error messages? Because somewhere in that file or an included file pixel_sum and pixel have to be declared. Otherwise gcc would be giving more error messages than just the one you showed.

In fact, to test gcc add the following to the top of my code

#include <stdlib.h>

typedef unsigned short pixel;

typedef int pixel_sum;

call the whole thing foo.c

this should then compile cleanly

gcc -c -o foo.o foo.c

If it does, it means you're not including this code w/ the right typedefs and includes

#include <stdlib.h>

typedef unsigned short pixel;

typedef int pixel_sum;

ahh... here's the problem.

pixel_sum is not an int, at least not in my program =(

```
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
```

>>this should then compile cleanly

Ok, that does compile =)

>>If it does, it means you're not including this code w/ the right typedefs and includes

But like I said in my previous post, pixel and pixel_sum are sturcts =(.

>>Can you show all the gcc error messages?

fudge[36] [~/perflab-handout/]> gcc -c -o kernels.o kernels.c

kernels.c: In function `do_row_sum':

kernels.c:657: error: invalid operands to binary +

kernels.c:659: error: invalid operands to binary +

kernels.c:660: error: invalid operands to binary +

kernels.c: In function `smooth':

kernels.c:715: error: invalid operands to binary +

kernels.c:717: error: invalid operands to binary +

kernels.c:718: error: invalid operands to binary +

kernels.c:726: error: invalid operands to binary +

kernels.c:728: error: invalid operands to binary +

kernels.c:729: error: invalid operands to binary +

kernels.c:735: error: invalid operands to binary +

kernels.c:737: error: invalid operands to binary +

kernels.c:738: error: invalid operands to binary +

kernels.c:669: warning: operation on `row' may be undefined

kernels.c:671: warning: operation on `row' may be undefined

kernels.c:671: warning: operation on `row' may be undefined

I think those are referring to this function:

static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {

int i;

add_pixel_2(sums++, row++, row);

for (i=1; i<dim-1; i++)

add_pixel_3(sums++, row-1, row++, row);

add_pixel_2(sums, row-1, row);

}

And apparently the program thinks there's one error here:

ERROR: Dimension=96, 288 errors

E.g.,

You have dst[2][95].{red,green,blue

It should be dst[2][95].{red,green,blue

Benchmark "smooth: Current working version" failed correctness check for dimension 96.

If you replace the do_row_sum() function with this one, does this make the warning on compilation go away?

```
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
add_pixel_2(sums++, row, row+1);
row++;
for (i=1; i<dim-1; i++) {
add_pixel_3(sums++, row-1, row, row+1);
row++;
}
add_pixel_2(sums, row-1, row);
}
```

```
void smooth(int dim, pixel *src, pixel *dst) {
pixel_sum *row_sums = malloc(3*dim*sizeof(pixel_sum));
pixel_sum *r0 = row_sums;
pixel_sum *r1 = r0+dim;
pixel_sum *r2;
int i, j;
do_row_sum(r0, dim, src);
do_row_sum(r1, dim, src+dim);
```

yes it does.

It works perfectly now =), thanks a lot for your help.

Initially it shows that it increases the performance by a factor of 2:

Smooth: Version = smooth: Current working version:

Dim 32 64 128 256 512 Mean

Your CPEs 165.4 177.0 180.0 179.5 333.5

Baseline CPEs 379.5 401.6 403.2 398.8 397.2

Speedup 2.3 2.3 2.2 2.2 1.2 2.0

But it can go much higher right? I ask this because of the silly random number generator our program is getting tested by. After 10 or so runs its improved its performance a bit more:

Smooth: Version = smooth: Current working version:

Dim 32 64 128 256 512 Mean

Your CPEs 83.8 160.0 171.7 173.4 326.8

Baseline CPEs 379.5 401.6 403.2 398.8 397.2

Speedup 4.5 2.5 2.3 2.3 1.2 2.4

Smooth: Version = smooth: Current working version:

Dim 32 64 128 256 512 Mean

Your CPEs 40.9 150.4 164.4 169.0 219.1

Baseline CPEs 379.5 401.6 403.2 398.8 397.2

Speedup 9.3 2.7 2.5 2.4 1.8 3.0

Just wanted to know how high it was designed to improve performance so I know how long I need to recompile and run it =)

I would guess it's just the test server being more or less busy...or cache effects on loading the program. Re-compiling won't help, but running it multiple times in succession might.

Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.

Yes, I agree with you there. The mean column doesn't work properly for the rotate function...

>>Re-compiling won't help, but running it multiple times in succession might.

yeah, I meant running it over and over again, so far I've ran it bout 10 times. I just wanted to get your opinion on how fast it can go =), because I'd probably be running this till midnight to see if it can keep topping itself... lol

But the formatting got messed up here, this is what it looks like:

```
Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 40.9 150.4 164.4 169.0 219.1
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 9.3 2.7 2.5 2.4 1.8 3.0
```

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.

this increases the memory used by a factor of 3.

Open in new window