asked on

Inline SIMD instructions in c/c++ code

Hi,

lets say I have the following c++ code:

sad = 0;
for(int i; i<16; i++){
for(int j; j<16; i++){
sad += abs(a[i][j] - b[i][j]);
}
}
// a and b have dimension height*width

How do I go about replacing the above and use inline SSE/SSE2 type instructions ( such as PSADBW - packed sum of absolute differences) instruction?

Or will GCC automatically translate the above code to the optimised format? I'm using GCC 3.2.3 and 64-bit AMD Opteron processor?

Infinity08

>> Or will GCC automatically translate the above code to the optimised format?

That you can easily check by looking at the generated assembler (using the -S switch).

You can help the compiler by specifying the architecture you use with the -march and -mtune flags (using opteron as CPU type) :

http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

and maybe even use -O3 optimization.

Infinity08

(btw, consider upgrading your gcc if that's an option)

>> How do I go about replacing the above and use inline SSE/SSE2 type instructions

You can use inline assembly using __asm__. Something like this maybe (untested) :

__asm__ __volatile__ (
                       "movups %0, %%xmm0\n\t"
                       "psadbw %1, %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

pixitron

ASKER

Thanks,

I compiled the file (gcc -S -o3) and checked the output assembly. It doesn't contain any packed SAD instructions. I also explored -march and -mtune but when I try opteron options I get an "invalid option" error, which I suspect is due to the old version of GCC I'm using.

pixitron

ASKER

Using a newer version of GCC is not an option I'm afraid. But I like the assembly that you've suggested above. In my case dest and src is what exactly? Do I need to almost cast parts of array a and b into a suitable format? Sorry for the newbie questions

Infinity08

>> but when I try opteron options I get an "invalid option" error, which I suspect is due to the old version of GCC I'm using.

I think so, yes. (that's the reason I suggested upgrading ;) )

This is the same help page for your version :

http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/i386-and-x86-64-Options.html#i386%20and%20x86-64%20Options

It does support the opteron CPU type.

>> In my case dest and src is what exactly?

That would be the C variables that will contain the destination and source buffers for the psadbw instruction.

Take a look at this post for example code :

http://lists.xiph.org/pipermail/theora-dev/2004-August/002347.html

(especially the "Original (hand-written) assembly version")

See below :

uint64 dest;
uint64 src;
 
__asm__ __volatile__ (
                       "movups %0, %%xmm0\n\t"
                       "psadbw %1, %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

Infinity08

>> It does support the opteron CPU type.

Sorry :

It does NOT support the opteron CPU type.

pixitron

ASKER

Thanks for the update, I'll need to spend a little time digesting the link. The example code in one of the links also does loop unrolling ( i think) which is very nice, but I should probably walk before trying to run :-) So if I were to just focus on the sum of abslute difference operation....

Whats the fastest way of pulling out and packing 8 chars from a char array that I can then use in a uint64 type?

I'm afraid this will heavily eat into any speedup gained from the psadbw?

Infinity08

>> Whats the fastest way of pulling out and packing 8 chars from a char array that I can then use in a uint64 type?

The uint64 type I used was just to state explicitly that we need 8 bytes of data. You can simply do something like the below code (again untested).

It passes the address of the first of the 8 bytes, which the assembler interprets correctly as the address of 8 consecutive bytes.

unsigned char dest[8];
unsigned char src[8];
 
__asm__ __volatile__ (
                       "movups (%0), %%xmm0\n\t"
                       "psadbw (%1), %%xmm0\n\t"
                       "movups %%mm0, (%0)"
                       : "=m" (dest)
                       : "m" (src)
                     );

Open in new window

Infinity08

Note that this code puts the result in dest, which is probably not what you want ... You can easily provide a different output variable though :

unsigned char buf1[8];
unsigned char buf2[8];
uint64 out;
 
__asm__ __volatile__ (
                       "movups (%1), %%xmm0\n\t"
                       "psadbw (%2), %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (out)
                       : "m" (buf1),
                         "m" (buf2)
                     );

Open in new window

pixitron

ASKER

Ok that seems reasonable, though when I use the above code I get a compiler error on the uint64 type (in both c and c++). Do I need an additional library?

Infinity08

>> I get a compiler error on the uint64 type (in both c and c++)

It is just a conceptual type that isn't actually defined. You can defined it yourself though :

typedef unsigned long uint64;

assuming that an unsigned long is 64 bits wide (check that with sizeof).

Or you can use an entirely different type (even one that is only 16 bits wide since that's sufficient to hold the result of the psadbw, but then you'll have to change the asm to suit that new type)

pixitron

ASKER

Oh ok thanks for the epxlaination. Although I'm still getting compilation errors, its probably something small,

/tmp/ccOj1ZHm.s: Assembler messages:
/tmp/ccOj1ZHm.s:19: Error: missing ')'
/tmp/ccOj1ZHm.s:19: Error: junk `(%rbp))' after expression
/tmp/ccOj1ZHm.s:20: Error: missing ')'
/tmp/ccOj1ZHm.s:20: Error: junk `(%rbp))' after expression
/tmp/ccOj1ZHm.s:21: Error: suffix or operands invalid for `movups'

int main(int argc, char *argv[]) {
 
 
typedef unsigned long uint64;
unsigned char buf1[8];
unsigned char buf2[8];
uint64 out;
 
__asm__ __volatile__ (
                       "movups (%1), %%xmm0\n\t"
                       "psadbw (%2), %%xmm0\n\t"
                       "movups %%mm0, %0"
                       : "=m" (out)
                       : "m" (buf1),
                         "m" (buf2)
                     );
 
 
}

Open in new window

ASKER CERTIFIED SOLUTION

Infinity08

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Infinity08

There's actually no reason for the extra buf1_ptr and buf2_ptr, but I added them to show how you can use a pointer that iterates over an existing buffer longer than 8 bytes.

pixitron

ASKER

Thanks very much for your help!