Solved

Code optimization

Posted on 2004-10-12
21
245 Views
Last Modified: 2010-05-18
Dear all,

I have been trying to optimize the following code in assembly or sse/mmx.
Can anyone help?
THanks

With Beat Regards
HCK


typedef struct PslRbpFeature {
      int noOfBlocks;  
      int* points;          
      int* specs;  
        int** sumPtr;
      double** sqSumPtr;
      float invArea;

      int refWeight;
      int id;  
      int winWidth;
      int winHeight;
 
      //precalculation
      float normStFactor;
}PslRbpFeature;



int evalRbpFeature(
   PslRbpFeature* rb,
   int offset,
   float* strength
){
      *strength=0.0f;

      float strength_plus=0.0f;
      float strength_minus=0.0f;
      int count_plus=0;
      int count_minus=0;

      int** sumPtr0=rb->sumPtr ;
      int** sumPtr1=sumPtr0+1;
      int** sumPtr2=sumPtr0+2;
      int** sumPtr3=sumPtr0+3;
      int* p0 = *sumPtr0 +offset;
      int* p1 = *sumPtr1 +offset;
      int* p2 = *sumPtr2 +offset;
      int* p3 = *sumPtr3 +offset;
      int sRef =   *p0  -  *p1   -  *p2  +  *p3 ;

      int N=rb->noOfBlocks-1;
      strength_plus=0.0;//sRef;
      count_plus=N;
      sRef/=N;  

      int key=0;      
      for(int n=0; n<N; n++){
      
             sumPtr0 +=4;
             sumPtr1 +=4;
             sumPtr2 +=4;
             sumPtr3 +=4;
             p0=*sumPtr0 +offset;
             p1=*sumPtr1 +offset;
             p2=*sumPtr2 +offset;
               p3=*sumPtr3 +offset;
             int s = *p0 -  *p1   -  *p2  +  *p3 ;
             s -= sRef;
             if(s<0) {
                  strength_minus += s;
                  count_minus++;
                  key |=   1  << n ;
             }
             
      }
      if(!key) return key;
 
        strength_plus  -= strength_minus;
      count_plus -= count_minus;
      
      *strength=strength_plus/count_plus-strength_minus/count_minus;
      *strength *= rb->normStFactor;
      return key;
}
0
Comment
Question by:hengck23
  • 8
  • 7
  • 6
21 Comments
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 250 total points
Comment Utility
Hi hengck23,

Several pointers.

The C compiler optimizers generally do a very good job of generating fast object code.  It's unlikely that you're going to beat this by more than just a few percent.

Indirect references chew up cycles.  Worse, they chew up cycles waiting for memory which are generally among the slowest instructions.  (There is a lot of difference in the number of instructions that AMD 850 and an AMD 2800 can execute in a fixed time interval as long as the operands are within the stack.  Reference memory and you'll find that they all wait at the same speed.)

Don't create stack variables within the function.  Every pass through the for() loop you modify the stack parameters by adding and deleting variables.  This chews up more cycles.  Create them as function variables instead of stack variables.


Good Luck!
Kent
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 250 total points
Comment Utility
I have a few suggestions:

1. Use as few local variables as possible. Loads of extra parameters stops the compiler using registers efficiently.

Try to remove all those sumPtrs and just use one and [].
Try to remove the p0/1/2/3 stuff. If you can get it down to offsets from sumPtrs even better.

2. Try to avoid floating point maths, especially inside loops.

You've got
>>strength_minus += s;

in the loop and the only reason its a float is because 'strength' is a float. If you make 'strength_minus' an int you can work out the maths at the end.

3. Break the process up into smaller parts. Often, the overhead of a function call is insignificant compared to the extra speed the compiler can gain by using just registers.

I'd suggest you understand exactly what the code is doing. It looks like it is running through an array of values and whenever a bit of maths generates a negative result you keep track of it. You then generate a final result called strength using some simple maths.

Try coding the loop in a separate function using as few variables as possible, then perform the maths.

4. Evaluate the current speed of the function before you start optimising. This will give you a good indication of when to stop and teach you what works best in your environment.

Paul


0
 

Author Comment

by:hengck23
Comment Utility
Dear all,

Based on the comments, I have modified my code to the below. There is an improvement of speed from 1.03 sec to 0.93 sec. However, it is still not fast enough. Is there anymore suggestions? Thanks.


static float pslReciprocalTable[] = {
      0.0f,
      1.0f/1.0f,             1.0f/2.0f,             1.0f/3.0f,
      1.0f/4.0f,             1.0f/5.0f,             1.0f/6.0f,
      1.0f/7.0f,             1.0f/8.0f,             1.0f/9.0f,
      1.0f/10.0f,       1.0f/11.0f,       1.0f/12.0f,
};

int evalRbpFeature2(
   PslRbpFeature* rb,
   int offset,
   float* strength
){
      *strength=0.0f;
      //..............................................................
 
      float strength_minus=0.0f;
      int count_minus=0;

      int** sumPtr=rb->sumPtr;
       int sRef =    sumPtr[0][offset]
                   -  sumPtr[1][offset]  
                     -  sumPtr[2][offset]  
                     +  sumPtr[3][offset];
      
 
    int N=rb->noOfBlocks-1;       
    sRef/=N;       

      int key=0;
      int s;
      for(int n=0; n<N; n++){
             sumPtr +=4;
           s =    sumPtr[0][offset]
                   -  sumPtr[1][offset]  
                   -  sumPtr[2][offset]  
                   +  sumPtr[3][offset];
             s -= sRef;
             if(s<0) {
                  strength_minus += s;
                  count_minus++;
                  key |=   1  << n ;
             }
             
      }
      if(!key) return key;
 
      *strength = strength_minus*(pslReciprocalTable[N-count_minus]+pslReciprocalTable[count_minus]);
    *strength *= rb->normStFactor;
      return key;
      
}
0
 
LVL 16

Assisted Solution

by:PaulCaswell
PaulCaswell earned 250 total points
Comment Utility
What range of values does offset and rb->noOfBlocks have? This might give us an idea of what is slowing it down.

Paul
0
 
LVL 16

Assisted Solution

by:PaulCaswell
PaulCaswell earned 250 total points
Comment Utility
I assume you didnt make 'strenght_minus' an integer for a good reason such as it can go out of range.

What is the idea behind 'key'?

Paul
0
 

Author Comment

by:hengck23
Comment Utility
Sorry, there is a slight typing error in my last comment. It should have been:

     int  strength_minus=0;
     int count_minus=0;
0
 

Author Comment

by:hengck23
Comment Utility
rb->noOfBlocks should be between 3 to 9
(for future code, i might increase that to 25).

The function evalRbpFeature2() will be called many times, with different PslRbpFeature* rb and int offset. The output of the function is key and strength.

Offset actually is an image postition. If (x,y) is a point in the iomage, then offset = x+y*width

'strenght_minus' would be unlikely to go  out of range. The largest  value it can get is -20*20*255.

0
 
LVL 16

Assisted Solution

by:PaulCaswell
PaulCaswell earned 250 total points
Comment Utility
This may be a silly one but see if you get any time improvement with:

        //    sumPtr +=4;
        s =    sumPtr[4*n][offset]
                -  sumPtr[4*n+1][offset]  
                -  sumPtr[4*n+2][offset]  
                +  sumPtr[4*n+3][offset];


Would it be possible to reorganise sumPtr so you can use:

        s =    sumPtr[offset][4*n]
                -  sumPtr[offset][4*n+1]  
                -  sumPtr[offset][4*n+2]  
                +  sumPtr[offset][4*n+3];

this would ensure that the four memory accesses are sequential in memory and would therefor increase the chance of cache hits.

Paul

0
 

Author Comment

by:hengck23
Comment Utility
>Would it be possible to reorganise sumPtr so you can use:

        s =    sumPtr[offset][4*n]
                -  sumPtr[offset][4*n+1]  
                -  sumPtr[offset][4*n+2]  
                +  sumPtr[offset][4*n+3];

I think it would be difficult.

There is an image data:
int* data;

sumPtr is actually a list of pointers that points to the key postitions in data, when the application point is(0,0) .eg)
sumPtr[0]=&data[ .... coordinte(0,0) ...]
sumPtr[1]=&data[ .... coordinte(5,6) ...]
sumPtr[2]=&data[ .... coordinte(8,9) ...]
....

When I change the application point to (x,y), it would be
new_sumPtr[0]=&data[ .... coordinte(0+x,0+y) ...] =sumPtr[0][offset]
new_sumPtr[1]=&data[ .... coordinte(5+x,6+y) ...] =sumPtr[1][offset]
new_sumPtr[2]=&data[ .... coordinte(8+x,9+y) ...] =sumPtr[2][offset]
0
 
LVL 16

Assisted Solution

by:PaulCaswell
PaulCaswell earned 250 total points
Comment Utility
>>Offset actually is an image postition. If (x,y) is a point in the iomage, then offset = x+y*width
Is there an outer loop then that's calling this function width*eight times? If so, you might shave a little off that.

Try to track your timings further into the code. Take a time at the start of the function and record how long it takes to get to the end of the loop and then how long the float maths takes. This would allow us to focus on the slowest part.

Is there any pre-calculation you can do while you are building the image? Perhaps you can evaluate 's' and 'key' elswhere and store it in the structure.

Are you sure you have switched on the most appropriate optimisations?

We are getting near the limit here. Have you got a specific figure in mind?

Paul
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 250 total points
Comment Utility
Hi hengck23,

Since rb->noOfBlocks can be between 3 and 25, you might get a pretty good performance boost by gathering the sumPtr data into a "local" array and basing your calculations on the local array.  If necessary, you can always scatter the changed data in the array back via the pointers in sumPtr, but you don't seem to modify the contents of the array(s).  (I used to work on vector systems where the scatter/gather functions were built into the hardware.  A truly nice feature for this kind of work!)  

This should generate a significant performance increase that only gets better as rb->noOfBlocks increases.  (You'll pay the "penalty" of the indirect reference only once, when you build the temporary array.  Thereafter, you require only one reference to get to the data.)

One thing in the code that really concerns me is the definition of sumPtr.  It's defined as a int**, but you commonly reference it as a two-dimensional array.  This is certainly within the rules of C, but I've never seen it done.  Are you sure that you're evaluating the data that you expect?



Kent
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
Kent,

Perhaps you arent as old as you used to be ;).

int** used to be the only reliable way of defining two-dimensional data because early C compilers had problems with int*[] and int[]. You will see this in the definition of the argv parameter to main in older compilers. This was especially true when both the row address array and the rows themselves are created using malloc. Nowadays, [] is handled much better and is even allowed to appear in structs in some compilers so long as it is at the end.

Strangely, I cant find any evidence to back me up on this.

Paul

0
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 250 total points
Comment Utility

Hi Paul,

The int** designation is quite common, but I've never seen anything defined as int** dereferenced as array[x][y].  The classic rules for two dimensional arrays state that the upper limit of *y* must be known so that the offset can be calculated whenever x is non-zero.

In the usage above, I assume that sizeof() the *y* dimension is simply sizeof (int*).  If so, the gather technique that I described should work quite well.

>  Strangely, I cant find any evidence to back me up on this.

My wife says this to me a lot, too.  ;)

0
 

Author Comment

by:hengck23
Comment Utility
Hi,

int** sumPtr is an arrays of pointers.
The actual data is int* data, which is not shown hewre in the code.
(Although an image is two-dimensional, I am treating it as a one demensional array in int* data)

Each element of sumPtr is actually a pointer that points to int* data.

The code is correct. It works the way I want. This is actually an object detector in image.It is a car detector.
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
Ahhh!! I see where youre going! How arrogant of me to assume it was something simple.

>>sumPtr[0][offset]

You're right, it does look a bit odd. Perhaps it would be clearer coding to use:
>>(sumPtr[0])[offset]

>>My wife says this to me a lot, too.  ;)
Consider yourself lucky. If my wife had no evidence to back her up she'd prove she didnt need any first. ;)

Paul
0
 

Author Comment

by:hengck23
Comment Utility
(sumPtr[0])[offset] is correct and this is what i am doing!

As an example, cosider:

raw image:
x x x x x
x x x x x
x x x x x

Assume that some kind of processing (eg filtering), i end up with the following data:

2 3 5 6 8
7 8 3 7 0
1 2 0 9 0

Then int* data will be
data[0]=2;
data[1]=3;
data[2]=5;
data[3]=6;
data[4]=8;
data[5]=7;
....
int* data will be fixed.

However, i will need to extract useful information (feature) of data.
This would be in the form of
feature = data[point0] - data[point1]-data[point2]+data[point3];

point0,point1,point2,point3 actually defines a rectangle (we call RECT), within a window.

Assuming
point0=(0,0)
point1=(2,0)
point2=(0,1)
point3=(2,1)

then,
 sumPtr[0]=&data[0] //point0
 sumPtr[1]=&data[2] //point1
 sumPtr[2]=&data[0+1*width] //point2
 sumPtr[3]=&data[2+1*width] //point3

(with offset=0)
feature =sumPtr[0][0]-sumPtr[1][0]-sumPtr[0][0]+sumPtr[2][0];
           =2-5-7+3;

Now imagine RECT is shifted by one unit to the right, then
new_point0=(0+1,0)
new_point1=(2+1,0)
new_point2=(0+1,1)
new_point3=(2+1,1)

(with offset=1)
new_feature =sumPtr[0][1]-sumPtr[1][1]-sumPtr[0][1]+sumPtr[2][1];
           =3-6-8+7;
0
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 250 total points
Comment Utility


>sumPtr is actually a list of pointers that points to the key postitions in data, when the application point is(0,0) .eg)
>sumPtr[0]=&data[ .... coordinte(0,0) ...]
>sumPtr[1]=&data[ .... coordinte(5,6) ...]
>sumPtr[2]=&data[ .... coordinte(8,9) ...]

The statement is inaccurate.  sumPtr (defined as int**) is a list of pointer to pointers to the key positions in the data.

Because *offset* doesn't change within the function, you should be able to gather the target data into an int* array.  You pay the double reference penalty only once.  As *N* increases, the savings should become more pronounced.

int *newPtr;

  newPtr = malloc (sizeof (int*) * n->noOfBlocks);
  for (idx = 0; idx < n->noOfBlocks; idx++)
    newPtr[idx] = sumPtr[idx][offset];

/*  perform calculations with newPtr[idx] instead of sumptr[idx][offset]  */

  free (newPtr);



Kent
0
 
LVL 16

Expert Comment

by:PaulCaswell
Comment Utility
I'd see that as a very worthwhile method to try. You've got the timer to check the results. I wouldnt use malloc/free however, I'd use a static array because noOfBlocks is only likely to go up to about 9.

Paul
0
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 250 total points
Comment Utility

True.  I'd thought about putting the newPtr array on the stack, but I think that you're right about it needing to be in static memory.  Since the stated limit of N is 25, the buffer is significantly less than 1K.  Well worth the "wasted" storage for the reduced references.


Kent
0
 

Author Comment

by:hengck23
Comment Utility
Hi,

Thank you for the responses. I will try to create a static memory and let you know the results maybe tomorrow.
0
 
LVL 45

Expert Comment

by:Kdo
Comment Utility
Hi hengck23,

If you're going to create a static array, its type should be 'int'.

Kent
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now