The smooth function, returns a "smoothed" version of an image, by taking the averages of all the pixels.

My question is... since the smooth function is calling the avg function repeatedly, would my time be better spent trying to speed up the avg function? or the smooth function?

I'm supposed to make it go at least twice as fast by using a technique called blocking... but I'm a little unsure about how to go about that.

Appreciate any help.

static pixel avg(int dim, int i, int j, pixel *src) { int ii, jj; pixel_sum sum; pixel current_pixel; initialize_pixel_sum(&sum); for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++) for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++) accumulate_sum(&sum, src[RIDX(ii, jj, dim)]); assign_sum_to_pixel(¤t_pixel, sum); return current_pixel;}void naive_smooth(int dim, pixel *src, pixel *dst) { int i, j; for (i = 0; i < dim; i++) for (j = 0; j < dim; j++) dst[RIDX(i, j, dim)] = avg(dim, i, j, src);}

I'm thinking the accumulate_sum function collects the values so far.

And the assign_sum_to_pixel function does not take the number of items that were added to make the sum... so where is the division taking place to find out the average?

0

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

If you look at what the code is trying to do, it's averaging all the pixels around any given pixel to produce the new value. So in most cases, you're averaging 9 values, on edges 6 values, and in corners, only 4 values.

Since there is some data re-use, cache blocking will help a bit. The biggest gains would come from collapsing all the functions into one. In that case, the blocking would be more apparent, you wouldn't have all the function calls (a huge win), etc.

If you go the latter route, realize that as you walk the pixels, the former sum contains most (2/3 in most cases) of the sum needed for the current sum. Basically if you use a moving window, you can avoid re-adding the same numbers each time.

Also... could you please explain the concept of cache blocking?

Is it the idea of limiting your statements to the limit of the cache memory? or is it a bit more complicated than that, and involves wording your loops just right?

It may make a small difference.
How much of your time is spent in procedure calls?
How much of your time is spent in the body of the function?
How much of your time is spent accessing src?
How much of your time is spent updating sum?
How much of your time is spent in cache misses?
How much of your time is spent doing the same thing more often than you need to?

To be honest... after reading that stuff, I think it just confirmed my original idea about cache blocks... which was limiting your statements so they fit in the cache perfectly.

can you make avg faster?
where is your program spending most of its time?
Is there any place where you are making the same computation multiple times when you could be doing it just once?

>>Is there any place where you are making the same computation multiple times when you could be doing it just once?

Hm... that would be the avg function, right? if it computes the average of every cell around the given cell... if it is reading 2 cells right next to each other... it would read 6 cells twice... right?

But... if we wanted to save the value of these cells... we would have to incorporate the avg function inside the smooth function, right?

The following code does most of what I talked about earlier. It uses a moving window to calculate series of sums. I checked that it compiles, but it might have logic errors since I didn't run it against any actual images.

Basically, for every row compute the sum of the src plus its neighbors. For the end of the rows, that's only 2 cells, for every other element the sum consists of 3 cells.

Next we need to add the neighbors above and below us to our sum, and compute the average and assign it to the dst. Now, we don't have to compute the partial sums across the entire array before we start computing averages, we can work (in general) 3 rows at a time. The top and bottom rows only require 2 rows of sums. So, we special case the first and rows again.

What does this have to do w/ blocking? Because we're using a moving window across 3 rows, we only need that to remain in cache, rather than the entire array.

One further optimization would be to interleave the calculation of the partial sums with the dst calculations, so you only need a moving window that is 3x3. The trade-off would be that while you would use less memory, you'll have to do more pointer arithmetic to get the right things to happen at the right time. My guess is the following would fit in cache, even for dim == 1K.

Can you show all the gcc error messages? Because somewhere in that file or an included file pixel_sum and pixel have to be declared. Otherwise gcc would be giving more error messages than just the one you showed.

In fact, to test gcc add the following to the top of my code

#include <stdlib.h>
typedef unsigned short pixel;
typedef int pixel_sum;

call the whole thing foo.c

this should then compile cleanly

gcc -c -o foo.o foo.c

If it does, it means you're not including this code w/ the right typedefs and includes

kernels.c:669: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined
kernels.c:671: warning: operation on `row' may be undefined

I think those are referring to this function:
static inline void do_row_sum(pixel_sum *sums, int dim, pixel *row) {
int i;
add_pixel_2(sums++, row++, row);
for (i=1; i<dim-1; i++)
add_pixel_3(sums++, row-1, row++, row);
add_pixel_2(sums, row-1, row);
}

And apparently the program thinks there's one error here:
ERROR: Dimension=96, 288 errors
E.g.,
You have dst[2][95].{red,green,blue} = {20835,18223,11591}
It should be dst[2][95].{red,green,blue} = {18245,16789,16176}
Benchmark "smooth: Current working version" failed correctness check for dimension 96.

>>If you replace the do_row_sum() function with this one, does this make the warning on compilation go away?

yes it does.

It works perfectly now =), thanks a lot for your help.

Initially it shows that it increases the performance by a factor of 2:

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 165.4 177.0 180.0 179.5 333.5
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 2.3 2.3 2.2 2.2 1.2 2.0

But it can go much higher right? I ask this because of the silly random number generator our program is getting tested by. After 10 or so runs its improved its performance a bit more:

Smooth: Version = smooth: Current working version:
Dim 32 64 128 256 512 Mean
Your CPEs 83.8 160.0 171.7 173.4 326.8
Baseline CPEs 379.5 401.6 403.2 398.8 397.2
Speedup 4.5 2.5 2.3 2.3 1.2 2.4

I would guess it's just the test server being more or less busy...or cache effects on loading the program. Re-compiling won't help, but running it multiple times in succession might.

Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.

>>Btw, those number don't make a lot of sense... there's something wrong about the 512 and Mean columns. They don't divide properly, and the Mean can't be higher than all of the other CPEs in the row.

Yes, I agree with you there. The mean column doesn't work properly for the rotate function...

>>Re-compiling won't help, but running it multiple times in succession might.
yeah, I meant running it over and over again, so far I've ran it bout 10 times. I just wanted to get your opinion on how fast it can go =), because I'd probably be running this till midnight to see if it can keep topping itself... lol

But the formatting got messed up here, this is what it looks like:

ah...ok.. properly formatted, it looks ok. I had the columns misaligned in my head. I don't like the drop-off in performance for larger N. Might look at something else later to try to keep more of it in cache.

>>Might look at something else later to try to keep more of it in cache.

Ah, ok, thanks =)

0

Featured Post

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net. To view more iPhone tutorials, visit www.sdkexpert.net.
This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn hoâ€¦

The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.