Inline SIMD instructions in c/c++ code

Hi,

lets say I have the following c++ code:

sad = 0;
for(int i; i<16; i++){
  for(int j; j<16; i++){
   sad += abs(a[i][j] - b[i][j]);
  }
}
// a and b have dimension height*width

How do I go about replacing the above and use inline SSE/SSE2 type instructions ( such as PSADBW - packed sum of absolute differences) instruction?

Or will GCC automatically translate the above code to the optimised format? I'm using GCC  3.2.3 and 64-bit AMD Opteron processor?
pixitronAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Infinity08Commented:
>> Or will GCC automatically translate the above code to the optimised format?

That you can easily check by looking at the generated assembler (using the -S switch).

You can help the compiler by specifying the architecture you use with the -march and -mtune flags (using opteron as CPU type) :

        http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

and maybe even use -O3 optimization.
0
Infinity08Commented:
(btw, consider upgrading your gcc if that's an option)

>> How do I go about replacing the above and use inline SSE/SSE2 type instructions

You can use inline assembly using __asm__. Something like this maybe (untested) :
__asm__ __volatile__ (
                       "movups %0, %%xmm0\n\t"
                       "psadbw %1, %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

0
pixitronAuthor Commented:
Thanks,

I compiled the file (gcc -S -o3)  and checked the output assembly. It doesn't contain any packed SAD instructions. I also explored -march and -mtune but when I try opteron options I get an "invalid option" error, which I suspect is due to the old version of GCC I'm using.
0
CompTIA Network+

Prepare for the CompTIA Network+ exam by learning how to troubleshoot, configure, and manage both wired and wireless networks.

pixitronAuthor Commented:
Using a newer version of GCC is not an option I'm afraid. But I like the assembly that you've suggested above. In my case dest and src is what exactly? Do I need to almost cast parts of array a and b into a suitable format? Sorry for the newbie questions
0
Infinity08Commented:
>> but when I try opteron options I get an "invalid option" error, which I suspect is due to the old version of GCC I'm using.

I think so, yes. (that's the reason I suggested upgrading ;) )

This is the same help page for your version :

        http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/i386-and-x86-64-Options.html#i386%20and%20x86-64%20Options

It does support the opteron CPU type.


>> In my case dest and src is what exactly?

That would be the C variables that will contain the destination and source buffers for the psadbw instruction.

Take a look at this post for example code :

        http://lists.xiph.org/pipermail/theora-dev/2004-August/002347.html

(especially the "Original (hand-written) assembly version")


See below :

uint64 dest;
uint64 src;
 
__asm__ __volatile__ (
                       "movups %0, %%xmm0\n\t"
                       "psadbw %1, %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

0
Infinity08Commented:
>> It does support the opteron CPU type.

Sorry :

It does NOT support the opteron CPU type.
0
pixitronAuthor Commented:
Thanks for the update, I'll need to spend a little time digesting the link. The example code in one of the links also does loop unrolling ( i think) which is very nice, but I should probably walk before trying to run :-) So if I were to just focus on the sum of abslute difference operation....

Whats the fastest way of pulling out and packing 8 chars from a char array that I can then use in a uint64 type?

I'm afraid this will heavily eat into any speedup  gained from the psadbw?




0
Infinity08Commented:
>> Whats the fastest way of pulling out and packing 8 chars from a char array that I can then use in a uint64 type?

The uint64 type I used was just to state explicitly that we need 8 bytes of data. You can simply do something like the below code (again untested).

It passes the address of the first of the 8 bytes, which the assembler interprets correctly as the address of 8 consecutive bytes.
unsigned char dest[8];
unsigned char src[8];
 
__asm__ __volatile__ (
                       "movups (%0), %%xmm0\n\t"
                       "psadbw (%1), %%xmm0\n\t"
                       "movups %%mm0, (%0)"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

0
Infinity08Commented:
Note that this code puts the result in dest, which is probably not what you want ... You can easily provide a different output variable though :
unsigned char buf1[8];
unsigned char buf2[8];
uint64 out;
 
__asm__ __volatile__ (
                       "movups (%1), %%xmm0\n\t"
                       "psadbw (%2), %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (out)
                       : "m" (buf1),
                         "m" (buf2)
                     );

Open in new window

0
pixitronAuthor Commented:
Ok that seems reasonable, though when I use the above code I get a compiler error on the uint64 type (in both c and c++). Do I need an additional library?
0
Infinity08Commented:
>> I get a compiler error on the uint64 type (in both c and c++)

It is just a conceptual type that isn't actually defined. You can defined it yourself though :

        typedef unsigned long uint64;

assuming that an unsigned long is 64 bits wide (check that with sizeof).

Or you can use an entirely different type (even one that is only 16 bits wide since that's sufficient to hold the result of the psadbw, but then you'll have to change the asm to suit that new type)
0
pixitronAuthor Commented:
Oh ok thanks for the epxlaination. Although I'm still getting compilation errors, its  probably something small,

/tmp/ccOj1ZHm.s: Assembler messages:
/tmp/ccOj1ZHm.s:19: Error: missing ')'
/tmp/ccOj1ZHm.s:19: Error: junk `(%rbp))' after expression
/tmp/ccOj1ZHm.s:20: Error: missing ')'
/tmp/ccOj1ZHm.s:20: Error: junk `(%rbp))' after expression
/tmp/ccOj1ZHm.s:21: Error: suffix or operands invalid for `movups'

int main(int argc, char *argv[]) {
 
 
typedef unsigned long uint64;
unsigned char buf1[8];
unsigned char buf2[8];
uint64 out;
 
__asm__ __volatile__ (
                       "movups (%1), %%xmm0\n\t"
                       "psadbw (%2), %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (out)
                       : "m" (buf1),
                         "m" (buf2)
                     );
 
 
}

Open in new window

0
Infinity08Commented:
Try the following (it's tested this time ;) ).

It should display :

        out = 00000040 (64)


#include <stdio.h>
 
int main(int argc, char *argv[]) {
  unsigned char buf1[8] = { 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F };
  unsigned char buf2[8] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 };
  unsigned char *buf1_ptr = buf1;
  unsigned char *buf2_ptr = buf2;
  unsigned int out = 0;   /* <--- 32 bit integer */
 
  __asm__ __volatile__ (
                         "movq (%1), %%mm0     \n\t"
                         "movq (%2), %%mm1     \n\t"
                         "psadbw %%mm1, %%mm0  \n\t"
                         "movd %%mm0, %0       \n\t"
                         : "=m" (out)
                         : "r" (buf1_ptr),
                           "r" (buf2_ptr)
                       );
 
  fprintf(stdout, "out = %08x (%d)\n", out, out);
 
  return 0; /* <--- in C this has to be there !! */
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Infinity08Commented:
There's actually no reason for the extra buf1_ptr and buf2_ptr, but I added them to show how you can use a pointer that iterates over an existing buffer longer than 8 bytes.
0
pixitronAuthor Commented:
Thanks very much for your help!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Assembly

From novice to tech pro — start learning today.