Solved

A question about optimization

Posted on 2009-05-14
45
850 Views
Last Modified: 2012-05-07
Hey,

       I was wondering if I could get some help to make this code run faster.  I have a few ideas, which all involve taking out the RIDX function.

#define RIDX(i,j,n) ((i)*(n)+(j))

I think I can reduce the computation in this statement:

src[RIDX(i, j, dim)];

when i is 0, its basically counting j... so I was thinking, what if we changed the bounds of the loop to:
for (i = 0; i < (dim * dim); i++)

And just had a j++ instead of RIDX(i, j, dim)??

I'm not so sure what to do about dst[RIDX(dim-1-j, i, dim)] though... could you please help me out?

Thanks in advance!
void naive_rotate(int dim, pixel *src, pixel *dst) 

{

    int i, j;
 

    for (i = 0; i < dim; i++)

	for (j = 0; j < dim; j++)

	    dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];

}

Open in new window

0
Comment
Question by:errang
  • 23
  • 10
  • 7
  • +2
45 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 24392596
  int i, j,k,l;

    k=0;
    for (i = 0; i < dim; i++){
        l = RIDX(dim-1-dim, i, dim);
        for (j = RIDX(dim-1-j, i, dim);; j>l < dim; j-=dim){
            dst[j] = src[k];
            k++;
       }
   }
0
 
LVL 7

Assisted Solution

by:johnnash1180
johnnash1180 earned 40 total points
ID: 24392606
First, I think that, you cannot use "for (i = 0; i < (dim * dim); i++)". Because, what if the the src or dst pointer's bound exceeds dim value? I think that dim is the total number of items in the src & dst pointers.

As u said, you can reduce 1 iteration for the i value = 0.
--------
for (j = 0, i=0; j < dim; j++, i++)
dst[(dim-1-j)*dim] = src[j];
--------

In the above code, u saved 1+1 add operation and 2 subtract & 1 multiply operations.

Then you can use the following statement...
--------
for(i=1; i<dim; i++)
 for(j=0; j<dim;j++)
 {
  if(j==0)
   {
    dst[((dim - 1)*dim)+i]=src[i*dim];
   }
   else
   {
     dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
   }
 }
-----
In the above statement, you saved 2 add operations when j=0. so, (dim-1)*2 add operations reduced by above code.

Try that!


0
 
LVL 84

Expert Comment

by:ozo
ID: 24392656
>         for (j = RIDX(dim-1-j, i, dim);; j>l < dim; j-=dim){
Sorry, that should have been
        for (j = RIDX(dim-1-0, i, dim); j>l ; j-=dim){
0
 
LVL 11

Assisted Solution

by:climbgunks
climbgunks earned 80 total points
ID: 24392677

Basically, this function is rotating an array.   To calculate an actual pixel position, it uses RIDX(), to calculate row*dimension+column.

So it is possible to just walk the src in order, and figure out where the destination pixels go

Note, that when j changes the dest changes by the dimension.   Also, note on each pass the dest column stays the same.  

int i, j, dst_idx;

for (i=0; i<dim; i++) {
    dst_idx = (dim-1)*dim + i;
    for (j=0; j<dim; j++) {
         dst[dst_idx] = *src++;
         dst_idx -= dim;
     }
}
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24393116
>> #define RIDX(i,j,n) ((i)*(n)+(j))
>>
>> src[RIDX(i, j, dim)];

Why not simply use a 2D array ? And let the compiler worry about optimizing it :
src[i][j]

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 24393431
You could even give the compiler a little help:

for (i = 0; i < dim; i++){
  for( sj=&src[i][0],dj=dst[dim-1];sj<&src[i][dim];sj++,dj--){
    dj[i]=*sj;
  }
}
You might fold a little more, depending on how dst is declared, and how dst[0[ and dst[0][0] are alligned
 
Or a smart enough compiler might have optimized even the dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)] loop even without any help
0
 

Author Comment

by:errang
ID: 24393505
Believe me... I'd love to leave it to the compiler, but my professor wants me to optimize that.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24393594
These kind of optimizations are compiler dependent. ie. they might give better performance on one specific compiler/platform, but worse performance on a different compiler/platform.

The only optimizations you should worry about, are algorithmic optimizations - leave all the rest (code reordering and the likes) to the compiler.
Only think about overriding the compiler's optimizations (or lack thereof) when an actual performance problem is discovered, which absolutely needs to be fixed (for a certain compiler/platform).


What reason for optimizing it did your professor give ? Did he point out which part he thinks is slow ? Did he give any clue as to the kind of optimization that he wants ? Does this have to be optimized for memory caching ? For fastest time ? Least data accesses ?
0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24395529

The primary reason for a function like this one to use this type of access pattern is when it's being called for images of different dimensions.   The 2-d approaches are only useful if one of the dimensions (usually columns) is fixed in advance.

0
 

Author Comment

by:errang
ID: 24404342
>>What reason for optimizing it did your professor give ? Did he point out which part he thinks is slow ? Did he give any clue as to the kind of optimization that he wants ? Does this have to be optimized for memory caching ? For fastest time ? Least data accesses ?

Far as reason goes, he just wants to see the best we can do, and it has to be optimized for the best processing time.
0
 

Author Comment

by:errang
ID: 24404367
So far, the best one's been: =)

Rotate: Version = rotate: Current working version:
Dim             64      128     256     512     1024    Mean
Your CPEs       9.5     3.7     16.7    24.2    121.5
Baseline CPEs   10.0    10.2    17.9    25.3    127.1
Speedup         1.0     2.7     1.1     1.0     1.0     1.3


int i, j, dst_idx;
 

for (i=0; i<dim; i++) {

    dst_idx = (dim-1)*dim + i;

    for (j=0; j<dim; j++) {

         dst[dst_idx] = *src++;

         dst_idx -= dim;

     }

}

Open in new window

0
 

Author Comment

by:errang
ID: 24404404
I'm trying to make:

int i, j, dst_idx;
 
for (i=0; i<dim; i++) {
    dst_idx = (dim-1)*dim + i;
    for (j=0; j<dim; j++) {
         dst[dst_idx] = *src++;
         dst_idx -= dim;
     }
}

go a bit faster... but I was wondering if I'm understanding the code correctly:

for (i=0; i<dim; i++) {            <=======this loop controls the array indexes, correct?
    dst_idx = (dim-1)*dim + i; <=======this statement is a replacement for RIDX(dim-1-j, i, dim), right?
    for (j=0; j<dim; j++) {        <=======this loop is copying the source to the destination?
         dst[dst_idx] = *src++; <=======I'm a little confused about the *src++ statement, it kept giving me errors when I tried to say something like *src + 2; is this a regular increment operator?
         dst_idx -= dim;            <=======since this is subtracting stuff, I'm thinking you are working backwards through the loop, right?
     }
}

Thanks again for your help so far =)
0
 
LVL 84

Expert Comment

by:ozo
ID: 24404423
//how about
dstp=&dst[dim*dim-1];
for (i=dim; i>0; --i) {
    dst=++dstp;
    for (j=dim; j>0; --j) {
         *(dst-=dim)= *src++;
     }
}
//but note that if you have a dim*dim array object, creating pointers more than one past the end of the object is undefined behaviour, even if those pointers are not dereferenced.
0
 

Author Comment

by:errang
ID: 24404437
yeah, that one is a bit better =)

Rotate: Version = rotate: Current working version:
Dim             64      128     256     512     1024    Mean
Your CPEs       5.0     5.7     11.0    21.6    138.9
Baseline CPEs   10.0    10.2    17.9    25.3    127.1
Speedup         2.0     1.8     1.6     1.2     0.9     1.4

although, I think its supposed to be much faster... I think there's something wrong with the program our professor gave to benchmark the program:

Rotate: Version = naive_rotate: Naive baseline implementation:
Dim             64      128     256     512     1024    Mean
Your CPEs       9.8     10.0    -11.5   -4.1    1.3
Baseline CPEs   10.0    10.2    17.9    25.3    127.1
Speedup         1.0     1.0Fatal Error: Non-positive CPE value...

That's what I got for the unchanged version of rotate, and that's supposed to work perfectly, so I'm pretty sure something's wrong somewhere.
0
 

Author Comment

by:errang
ID: 24405095
I'm supposed to use a technique called blocking to speed up this process.

http://en.wikipedia.org/wiki/Blocking_(computing)

But according to that, Blocking occurs when a function does not return until it either completes its task or results in an error.

How would that speed up the program? And is it some kind of system call? Or another algorithm, like semaphores used for atomic operations?


0
 
LVL 84

Expert Comment

by:ozo
ID: 24405113
Do you mean Cache Blocking?
which would involve dividing the array into smaller sub-arrays that fit into  cache to improve access time.
0
 

Author Comment

by:errang
ID: 24405120
yeah, that's what it says in my book.
0
 

Author Comment

by:errang
ID: 24405136
>>which would involve dividing the array into smaller sub-arrays that fit into  cache to improve access time.

So wait... are you saying the loops would go something like this?

for (j=dim; j>(dim/2); --j) {
-----
}

for (j=(dim/2); j>0; --j){
-----
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 24405143
maybe,  it depends how much of the arrays fit in your cache
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24405145
What CPU are you working on ? Will your professor use the same CPU when he's testing your program ? Cache optimization is definitely platform specific, and good performance on your system might result in bad performance on his.

Also, will you deliver the binary ? Or the code ? In the latter case, will you be using the same compiler ?
0
 

Author Comment

by:errang
ID: 24405488
>>What CPU are you working on ? Will your professor use the same CPU when he's testing your program ?

 I'm not exactly sure what type of CPU it is, we are using a university server, so he will be using the same CPU.  If the CPU on my computer has any impact on this, I'm using a AMD 64 X2 Dual 4200+ 2.2 ghz processor, and he will take the result I get on my computer into consideration.

The cache size of my CPU is 512 KB (http://www.newegg.com/product/product.aspx?Item=N82E16819103747)

>>Also, will you deliver the binary ? Or the code ? In the latter case, will you be using the same compiler ?

I'm going to have to give the C code, and he will be using the same compiler.  We are using the gcc compiler.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24405874
So, you're using different CPU's, the same compiler (but possibly different versions of the compiler ?), but not necessarily the same compiler flags.

That's an impossible scenario for this kind of assignment. You need to perform platform dependent optimizations, and test it in a different environment than the one it's actually being run on.

Unless you're doing your development on the university server. If so, if you want to optimize for that server, you need to know some more details about the hardware. If you don't know the hardware, can you at least tell what OS it's running ?
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:errang
ID: 24407617
>>So, you're using different CPU's, the same compiler (but possibly different versions of the compiler ?), but not necessarily the same compiler flags.

no, we'r using the same version of the compiler and the same flags, I've got a makefile for this assignment.

>>Unless you're doing your development on the university server. If so, if you want to optimize for that server, you need to know some more details about the hardware. If you don't know the hardware, can you at least tell what OS it's running ?

I don't know what the universty's hardware is like, but I know its running a sun solaris operating system.
0
 

Author Comment

by:errang
ID: 24407680
>>That's an impossible scenario for this kind of assignment. You need to perform platform dependent optimizations, and test it in a different environment than the one it's actually being run on.

That's... not quite encouraging me =( lol.  My professor is going to take the result I get on my computer into consideration.
0
 

Author Comment

by:errang
ID: 24408676
Hm... could someone please give a small example of how loop unrolling works?

The way I'm understanding it, is like this:

for(i = 0; i < total/4; i++){
   destination[i] = source[i];
   destination[i+1] = source[i+1];
   destination[i+2] = source[i+2];
   destination[i+3] = source[i+3];
}

But its not quite working out...
0
 
LVL 84

Expert Comment

by:ozo
ID: 24408715
for(i = 0; i < total; i+=4){
   destination[i] = source[i];
   destination[i+1] = source[i+1];
   destination[i+2] = source[i+2];
   destination[i+3] = source[i+3];
}
0
 

Author Comment

by:errang
ID: 24408820
sweet, thanks =D

Could you please explain the concept of blocking too please? My book isn't very clear =(
0
 

Author Comment

by:errang
ID: 24408887
Hm... I tried loop unrolling, I saw that the test cases were all multiples of 32, so this is what I got:

How would I apply the cache blocking concept to this? I'm still not sure what it is... but I did remember that my book said something about instructions finishing in one go? So does that mean I need to plan out how I word my statements so they fit the cache size?
int i, j;
 

  for (i = 0; i < dim; i++)

   for (j = 0; j < dim; j+=32){

    dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];

    dst[RIDX(dim-1-(j+1), i, dim)] = src[RIDX(i, (j+1), dim)];

    dst[RIDX(dim-1-(j+2), i, dim)] = src[RIDX(i, (j+2), dim)];

    dst[RIDX(dim-1-(j+3), i, dim)] = src[RIDX(i, (j+3), dim)];

    dst[RIDX(dim-1-(j+4), i, dim)] = src[RIDX(i, (j+4), dim)];

    dst[RIDX(dim-1-(j+5), i, dim)] = src[RIDX(i, (j+5), dim)];

    dst[RIDX(dim-1-(j+6), i, dim)] = src[RIDX(i, (j+6), dim)];

    dst[RIDX(dim-1-(j+7), i, dim)] = src[RIDX(i, (j+7), dim)];

    dst[RIDX(dim-1-(j+8), i, dim)] = src[RIDX(i, (j+8), dim)];

    dst[RIDX(dim-1-(j+9), i, dim)] = src[RIDX(i, (j+9), dim)];

    dst[RIDX(dim-1-(j+10), i, dim)] = src[RIDX(i, (j+10), dim)];

    dst[RIDX(dim-1-(j+11), i, dim)] = src[RIDX(i, (j+11), dim)];

    dst[RIDX(dim-1-(j+12), i, dim)] = src[RIDX(i, (j+12), dim)];

    dst[RIDX(dim-1-(j+13), i, dim)] = src[RIDX(i, (j+13), dim)];

    dst[RIDX(dim-1-(j+14), i, dim)] = src[RIDX(i, (j+14), dim)];

    dst[RIDX(dim-1-(j+15), i, dim)] = src[RIDX(i, (j+15), dim)];

    dst[RIDX(dim-1-(j+16), i, dim)] = src[RIDX(i, (j+16), dim)];

    dst[RIDX(dim-1-(j+17), i, dim)] = src[RIDX(i, (j+17), dim)];

    dst[RIDX(dim-1-(j+18), i, dim)] = src[RIDX(i, (j+18), dim)];

    dst[RIDX(dim-1-(j+19), i, dim)] = src[RIDX(i, (j+19), dim)];

    dst[RIDX(dim-1-(j+20), i, dim)] = src[RIDX(i, (j+20), dim)];

    dst[RIDX(dim-1-(j+21), i, dim)] = src[RIDX(i, (j+21), dim)];

    dst[RIDX(dim-1-(j+22), i, dim)] = src[RIDX(i, (j+22), dim)];

    dst[RIDX(dim-1-(j+23), i, dim)] = src[RIDX(i, (j+23), dim)];

    dst[RIDX(dim-1-(j+24), i, dim)] = src[RIDX(i, (j+24), dim)];

    dst[RIDX(dim-1-(j+25), i, dim)] = src[RIDX(i, (j+25), dim)];

    dst[RIDX(dim-1-(j+26), i, dim)] = src[RIDX(i, (j+26), dim)];

    dst[RIDX(dim-1-(j+27), i, dim)] = src[RIDX(i, (j+27), dim)];

    dst[RIDX(dim-1-(j+28), i, dim)] = src[RIDX(i, (j+28), dim)];

    dst[RIDX(dim-1-(j+29), i, dim)] = src[RIDX(i, (j+29), dim)];

    dst[RIDX(dim-1-(j+30), i, dim)] = src[RIDX(i, (j+30), dim)];

    dst[RIDX(dim-1-(j+31), i, dim)] = src[RIDX(i, (j+31), dim)];

   }

Open in new window

0
 
LVL 11

Assisted Solution

by:climbgunks
climbgunks earned 80 total points
ID: 24408948

neither of those loop unrolling blocks will work if the image isn't a multiple of 4.

You should break it up into one piece that is a multiple of N and another that takes care of the remainder.   Otherwise, you've taken a function that works for all images and made it work only for those that are divisible by your unrolling factor.

search for "cache blocking"....  your computer has a fast cache so that recently accessed memory locations are in a fast local cache.   You want to avoid cache misses, so you should try to continue to access the same addresses multiple times if necessary, rather than jumping around amongst addresses.   Cache sizes vary, but it's expensive memory on the processor, so typical ranges may be 256k to 4M <that's a bit of a guess>.



0
 

Author Comment

by:errang
ID: 24408954
>>neither of those loop unrolling blocks will work if the image isn't a multiple of 4.

yes, but this is one of the things we are allowed to assume, all the test cases will be multiples of 32.
0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24408958

I don't see how cache blocking applies in this case.  It does in the smoothing algorithm -- because each pixel ends up contributing to the avg of itself and all its neighbors.  

Here, you're using each src and dst only once.   There are other cache effects, such as disk block caching, but since you can access either the src or dst array in order, but not both at the same time, as long as you're doing one of them, you can't do any better.

0
 

Author Comment

by:errang
ID: 24408960
ah... ok.
0
 

Author Comment

by:errang
ID: 24408980
hm... this goes against everything I've been taught... but I'm just brute forcing it right now...

The best general algorithm was the one by ozo:

dstp=&dst[dim*dim-1];
for (i=dim; i>0; --i) {
    dst=++dstp;
    for (j=dim; j>0; --j) {
         *(dst-=dim)= *src++;
     }
}

it gave this result:
Rotate: Version = rotate: Current working version:
Dim             64      128     256     512     1024    Mean
Your CPEs       5.0     4.6     15.9    25.6    165.5
Baseline CPEs   10.0    10.2    17.9    25.3    127.1
Speedup         2.0     2.2     1.1     1.0     0.8     1.3

But... with the brute force approach...
Rotate: Version = rotate: Current working version:
Dim             64      128     256     512     1024    Mean
Your CPEs       5.5     0.5     14.9    25.0    153.7
Baseline CPEs   10.0    10.2    17.9    25.3    127.1
Speedup         1.8     20.4    1.2     1.0     0.8     2.1


int i, j;
 

  for (i = 0; i < dim; i++){

   

  if((dim%64) == 0)

   for (j = 0; j < dim; j+=64){

    dst[RIDX(dim-1-(j), i, dim)] = src[RIDX(i, (j), dim)];

    dst[RIDX(dim-1-(j+1), i, dim)] = src[RIDX(i, (j+1), dim)];

    dst[RIDX(dim-1-(j+2), i, dim)] = src[RIDX(i, (j+2), dim)];

    dst[RIDX(dim-1-(j+3), i, dim)] = src[RIDX(i, (j+3), dim)];

    dst[RIDX(dim-1-(j+4), i, dim)] = src[RIDX(i, (j+4), dim)];

    dst[RIDX(dim-1-(j+5), i, dim)] = src[RIDX(i, (j+5), dim)];

    dst[RIDX(dim-1-(j+6), i, dim)] = src[RIDX(i, (j+6), dim)];

    dst[RIDX(dim-1-(j+7), i, dim)] = src[RIDX(i, (j+7), dim)];

    dst[RIDX(dim-1-(j+8), i, dim)] = src[RIDX(i, (j+8), dim)];

    dst[RIDX(dim-1-(j+9), i, dim)] = src[RIDX(i, (j+9), dim)];

    dst[RIDX(dim-1-(j+10), i, dim)] = src[RIDX(i, (j+10), dim)];

    dst[RIDX(dim-1-(j+11), i, dim)] = src[RIDX(i, (j+11), dim)];

    dst[RIDX(dim-1-(j+12), i, dim)] = src[RIDX(i, (j+12), dim)];

    dst[RIDX(dim-1-(j+13), i, dim)] = src[RIDX(i, (j+13), dim)];

    dst[RIDX(dim-1-(j+14), i, dim)] = src[RIDX(i, (j+14), dim)];

    dst[RIDX(dim-1-(j+15), i, dim)] = src[RIDX(i, (j+15), dim)];

    dst[RIDX(dim-1-(j+16), i, dim)] = src[RIDX(i, (j+16), dim)];

    dst[RIDX(dim-1-(j+17), i, dim)] = src[RIDX(i, (j+17), dim)];

    dst[RIDX(dim-1-(j+18), i, dim)] = src[RIDX(i, (j+18), dim)];

    dst[RIDX(dim-1-(j+19), i, dim)] = src[RIDX(i, (j+19), dim)];

    dst[RIDX(dim-1-(j+20), i, dim)] = src[RIDX(i, (j+20), dim)];

    dst[RIDX(dim-1-(j+21), i, dim)] = src[RIDX(i, (j+21), dim)];

    dst[RIDX(dim-1-(j+22), i, dim)] = src[RIDX(i, (j+22), dim)];

    dst[RIDX(dim-1-(j+23), i, dim)] = src[RIDX(i, (j+23), dim)];

    dst[RIDX(dim-1-(j+24), i, dim)] = src[RIDX(i, (j+24), dim)];

    dst[RIDX(dim-1-(j+25), i, dim)] = src[RIDX(i, (j+25), dim)];

    dst[RIDX(dim-1-(j+26), i, dim)] = src[RIDX(i, (j+26), dim)];

    dst[RIDX(dim-1-(j+27), i, dim)] = src[RIDX(i, (j+27), dim)];

    dst[RIDX(dim-1-(j+28), i, dim)] = src[RIDX(i, (j+28), dim)];

    dst[RIDX(dim-1-(j+29), i, dim)] = src[RIDX(i, (j+29), dim)];

    dst[RIDX(dim-1-(j+30), i, dim)] = src[RIDX(i, (j+30), dim)];

    dst[RIDX(dim-1-(j+31), i, dim)] = src[RIDX(i, (j+31), dim)];

    dst[RIDX(dim-1-(j+32), i, dim)] = src[RIDX(i, (j+32), dim)];

    dst[RIDX(dim-1-(j+33), i, dim)] = src[RIDX(i, (j+33), dim)];

    dst[RIDX(dim-1-(j+34), i, dim)] = src[RIDX(i, (j+34), dim)];

    dst[RIDX(dim-1-(j+35), i, dim)] = src[RIDX(i, (j+35), dim)];

    dst[RIDX(dim-1-(j+36), i, dim)] = src[RIDX(i, (j+36), dim)];

    dst[RIDX(dim-1-(j+37), i, dim)] = src[RIDX(i, (j+37), dim)];

    dst[RIDX(dim-1-(j+38), i, dim)] = src[RIDX(i, (j+38), dim)];

    dst[RIDX(dim-1-(j+39), i, dim)] = src[RIDX(i, (j+39), dim)];

    dst[RIDX(dim-1-(j+40), i, dim)] = src[RIDX(i, (j+40), dim)];

    dst[RIDX(dim-1-(j+41), i, dim)] = src[RIDX(i, (j+41), dim)];

    dst[RIDX(dim-1-(j+42), i, dim)] = src[RIDX(i, (j+42), dim)];

    dst[RIDX(dim-1-(j+43), i, dim)] = src[RIDX(i, (j+43), dim)];

    dst[RIDX(dim-1-(j+44), i, dim)] = src[RIDX(i, (j+44), dim)];

    dst[RIDX(dim-1-(j+45), i, dim)] = src[RIDX(i, (j+45), dim)];

    dst[RIDX(dim-1-(j+46), i, dim)] = src[RIDX(i, (j+46), dim)];

    dst[RIDX(dim-1-(j+47), i, dim)] = src[RIDX(i, (j+47), dim)];

    dst[RIDX(dim-1-(j+48), i, dim)] = src[RIDX(i, (j+48), dim)];

    dst[RIDX(dim-1-(j+49), i, dim)] = src[RIDX(i, (j+49), dim)];

    dst[RIDX(dim-1-(j+50), i, dim)] = src[RIDX(i, (j+50), dim)];

    dst[RIDX(dim-1-(j+51), i, dim)] = src[RIDX(i, (j+51), dim)];

    dst[RIDX(dim-1-(j+52), i, dim)] = src[RIDX(i, (j+52), dim)];

    dst[RIDX(dim-1-(j+53), i, dim)] = src[RIDX(i, (j+53), dim)];

    dst[RIDX(dim-1-(j+54), i, dim)] = src[RIDX(i, (j+54), dim)];

    dst[RIDX(dim-1-(j+55), i, dim)] = src[RIDX(i, (j+55), dim)];

    dst[RIDX(dim-1-(j+56), i, dim)] = src[RIDX(i, (j+56), dim)];

    dst[RIDX(dim-1-(j+57), i, dim)] = src[RIDX(i, (j+57), dim)];

    dst[RIDX(dim-1-(j+58), i, dim)] = src[RIDX(i, (j+58), dim)];

    dst[RIDX(dim-1-(j+59), i, dim)] = src[RIDX(i, (j+59), dim)];

    dst[RIDX(dim-1-(j+60), i, dim)] = src[RIDX(i, (j+60), dim)];

    dst[RIDX(dim-1-(j+61), i, dim)] = src[RIDX(i, (j+61), dim)];

    dst[RIDX(dim-1-(j+62), i, dim)] = src[RIDX(i, (j+62), dim)];

    dst[RIDX(dim-1-(j+63), i, dim)] = src[RIDX(i, (j+63), dim)];

   }

  else if((dim%32) == 0) 

   for (j = 0; j < dim; j+=32){

    dst[RIDX(dim-1-(j), i, dim)] = src[RIDX(i, (j), dim)];

    dst[RIDX(dim-1-(j+1), i, dim)] = src[RIDX(i, (j+1), dim)];

    dst[RIDX(dim-1-(j+2), i, dim)] = src[RIDX(i, (j+2), dim)];

    dst[RIDX(dim-1-(j+3), i, dim)] = src[RIDX(i, (j+3), dim)];

    dst[RIDX(dim-1-(j+4), i, dim)] = src[RIDX(i, (j+4), dim)];

    dst[RIDX(dim-1-(j+5), i, dim)] = src[RIDX(i, (j+5), dim)];

    dst[RIDX(dim-1-(j+6), i, dim)] = src[RIDX(i, (j+6), dim)];

    dst[RIDX(dim-1-(j+7), i, dim)] = src[RIDX(i, (j+7), dim)];

    dst[RIDX(dim-1-(j+8), i, dim)] = src[RIDX(i, (j+8), dim)];

    dst[RIDX(dim-1-(j+9), i, dim)] = src[RIDX(i, (j+9), dim)];

    dst[RIDX(dim-1-(j+10), i, dim)] = src[RIDX(i, (j+10), dim)];

    dst[RIDX(dim-1-(j+11), i, dim)] = src[RIDX(i, (j+11), dim)];

    dst[RIDX(dim-1-(j+12), i, dim)] = src[RIDX(i, (j+12), dim)];

    dst[RIDX(dim-1-(j+13), i, dim)] = src[RIDX(i, (j+13), dim)];

    dst[RIDX(dim-1-(j+14), i, dim)] = src[RIDX(i, (j+14), dim)];

    dst[RIDX(dim-1-(j+15), i, dim)] = src[RIDX(i, (j+15), dim)];

    dst[RIDX(dim-1-(j+16), i, dim)] = src[RIDX(i, (j+16), dim)];

    dst[RIDX(dim-1-(j+17), i, dim)] = src[RIDX(i, (j+17), dim)];

    dst[RIDX(dim-1-(j+18), i, dim)] = src[RIDX(i, (j+18), dim)];

    dst[RIDX(dim-1-(j+19), i, dim)] = src[RIDX(i, (j+19), dim)];

    dst[RIDX(dim-1-(j+20), i, dim)] = src[RIDX(i, (j+20), dim)];

    dst[RIDX(dim-1-(j+21), i, dim)] = src[RIDX(i, (j+21), dim)];

    dst[RIDX(dim-1-(j+22), i, dim)] = src[RIDX(i, (j+22), dim)];

    dst[RIDX(dim-1-(j+23), i, dim)] = src[RIDX(i, (j+23), dim)];

    dst[RIDX(dim-1-(j+24), i, dim)] = src[RIDX(i, (j+24), dim)];

    dst[RIDX(dim-1-(j+25), i, dim)] = src[RIDX(i, (j+25), dim)];

    dst[RIDX(dim-1-(j+26), i, dim)] = src[RIDX(i, (j+26), dim)];

    dst[RIDX(dim-1-(j+27), i, dim)] = src[RIDX(i, (j+27), dim)];

    dst[RIDX(dim-1-(j+28), i, dim)] = src[RIDX(i, (j+28), dim)];

    dst[RIDX(dim-1-(j+29), i, dim)] = src[RIDX(i, (j+29), dim)];

    dst[RIDX(dim-1-(j+30), i, dim)] = src[RIDX(i, (j+30), dim)];

    dst[RIDX(dim-1-(j+31), i, dim)] = src[RIDX(i, (j+31), dim)];

   }

}

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 24408993
if src[RIDX(i, j, dim)] and  src[RIDX(i, (j+1), dim)] are in the same cache block
then access can be out of fast memory
if  dst[RIDX(dim-1-j, i, dim)] and   dst[RIDX(dim-1-(j+1), i, dim)] are in different cache blocks
then a cache miss may mean slower access.

reordering the way you access memory to minimize this may be a more sophisticated operation than might be expected of someone with the level of programing experience that you seem to be exhibiting.
0
 
LVL 84

Expert Comment

by:ozo
ID: 24409002
what are CPEs?
0
 

Author Comment

by:errang
ID: 24409008
>>reordering the way you access memory to minimize this may be a more sophisticated operation than might be expected of someone with the level of programing experience that you seem to be exhibiting.

Yes... I know, like I said "this goes against everything I've been taught", but this is due in 2 days and I have to also optimize the smooth function =(. Time is not my friend.
0
 

Author Comment

by:errang
ID: 24409011
>>what are CPEs?

CPE = Cycles per Element
0
 
LVL 84

Accepted Solution

by:
ozo earned 240 total points
ID: 24409353
Why is there a dip in CPEs at dim 128?


maybe you can combine two different optimizations:

dstp=&dst[dim*dim-1];
for (i=dim; i>0; i--) {
    dst=++dstp;
    for (j=dim; j>0; j-=4) {
         *(dst-=dim)= *src++;
         *(dst-=dim)= *src++;
         *(dst-=dim)= *src++;
         *(dst-=dim)= *src++;
     }
}
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24409753
>> I don't know what the universty's hardware is like, but I know its running a sun solaris operating system.

'uname -a' gives some basic information about the platform and CPU.
'prtdiag' gives some more specific information about the CPU's in the system.

You can then check the Sun site for the specifications of the CPU (assuming it's a Sun CPU). For example for the UltraSPARC-T1 :

    http://www.sun.com/processors/UltraSPARC-T1/specs.xml

you find that it has an L1 cache of 8kB per core and an L2 cache of 3MB.


>> That's... not quite encouraging me =( lol.  My professor is going to take the result I get on my computer into consideration.

Ok. Do test and optimize it on the university server would be my advice.


>> we'r using the same version of the compiler and the same flags

I would doubt that, unless you're also running a Solaris system, on the same architecture, with the same exact version of the compiler.


>> That's... not quite encouraging me =( lol.

I'm sorry about that ;) All I want to make clear is that these kind of optimizations are very platform dependent, and might decrease the performance on one platform, while increasing performance on another. So, keep that in mind.
0
 

Author Comment

by:errang
ID: 24410818
>>you find that it has an L1 cache of 8kB per core and an L2 cache of 3MB.

Ok... say I have that much, I still haven't had my question answered about how this works out in code, and I've asked it about 5 or 6 times by now... =(

But I'm trying to get a definite yes or no for this question...

"The first idea behind register blocking is to reorganize the matrix data structure to expose small dense blocks whose size corresponds roughly with the number of machine registers. This is shown schematically in fig. 5 (top). As implemented in SPARSity, we create   register blocks by imposing a ``mask'' of   tiles on top of the matrix; any tile with at least one non-zero in it is stored as a block in the new matrix. We may need to add extra zeros to make the blocks dense--this is referred to as fill.

The second idea is to keep little blocks of y and x in registers, and to access the blocks of A with unit-stride. We note that an element of y need only be referenced twice if it can be allocated to a register, instead of twice per column (one reference to load, and one to store, so that we can accumulate into the block). For an   block, we save 2(c-1) accesses per "

And according to http://www.cs.berkeley.edu/~richie/cs267/mg/report/node35.html , cache blocking is an extension of register blocking... so is it the same principle, just with caches?

We are supposed to break up a matrix data structure into small blocks that correspond to the size of the cache??

And could you please give a small example of how this works out in code? Like if
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)]; was the unmodified thing that wasn't optimized for cache blocking... what would be a way to use cache blocking on that statement?

a = RIDX(dim-1-j, i, dim);
b = RIDX(i, j, dim);
dst[a] = src[b];

Is this the basic idea? split up the computation so it can all get loaded into cache memory at once and get completed??
0
 
LVL 53

Assisted Solution

by:Infinity08
Infinity08 earned 140 total points
ID: 24411033
>> Ok... say I have that much

You have to find out how much you ACTUALLY have ;)


>> I still haven't had my question answered about how this works out in code, and I've asked it about 5 or 6 times by now... =(

All the responses made by the other experts were dealing with that, no ?


>> Is this the basic idea? split up the computation so it can all get loaded into cache memory at once and get completed??

The idea is to minimize the amount of cache misses. Make sure that each element in the matrix is loaded in the cache as little as possible (preferably only once), and more importantly, that optimal advantage is taken of the neighboring data that is loaded in the cache at the same time.

But I can't repeat it enough : until you find out what CPU you're working with, and how its caches are managed, you can only guess at what the best strategy would be (you can make an educated guess, but it's a guess nonetheless).

I'm also not too sure that you'll be able to get a big speed-up for this code.
0
 

Author Comment

by:errang
ID: 24412347
>>You have to find out how much you ACTUALLY have ;)

yeah, i know, I was just trying to get a general idea of it.

>>All the responses made by the other experts were dealing with that, no ?

well, not quite all of them, I asked the question once or twice, got ignored, asked it a few more times, on one of those occasions I've been told to search cache blocking, I did t hat, didn't understand what it was, got my question ignored a few more times.  Basically I just wanted a yes or no answer to my question =(.  Or maybe I've been asking many places... I can't remember anymore... its all a big painful blur...
0
 

Author Comment

by:errang
ID: 24412363
>>The idea is to minimize the amount of cache misses. Make sure that each element in the matrix is loaded in the cache as little as possible (preferably only once), and more importantly, that optimal advantage is taken of the neighboring data that is loaded in the cache at the same time.

That does translate to what I've kinda been blabbering about... right? limit the size of the statements so they fit in the cache?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 24412548
>> limit the size of the statements so they fit in the cache?

That's one part of it, yes : split up your data in blocks that fit in the cache and can be processed separately.

But given the nature of the algorithm (rotation of a matrix), it's not exactly as simple as that ;)
0
 

Author Comment

by:errang
ID: 24413168
>>But given the nature of the algorithm (rotation of a matrix), it's not exactly as simple as that ;)

when is it ever simple...? lol.  I'm just trying to make sure I understand the basics of what I need to do.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.
The goal of this video is to provide viewers with basic examples to understand opening and reading files in the C programming language.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now