milosr
asked on
Fast Array Bitwise OR?
I 've got one array with bit pattern, and other array which I would like to Bitwise OR with pattern.
Are there instructions (MMX, SSE, ...) which allow to do this fast, (like REP MOVSD for fast array copy).
This is part of code in C that I want to optimize:
for(i=0; i<iSize; i++) {
pArray[i] |= pattern[i % iPatternSize];
}
Thanks,
Milos
Are there instructions (MMX, SSE, ...) which allow to do this fast, (like REP MOVSD for fast array copy).
This is part of code in C that I want to optimize:
for(i=0; i<iSize; i++) {
pArray[i] |= pattern[i % iPatternSize];
}
Thanks,
Milos
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
right, so we want to minimize memory access.
why read 1 byte at a time when we can read 32 bytes... ?
why read 1 byte at a time when we can read 32 bytes... ?
Most PC memory buses are only 4 or 8 bytes wide. So there's not much gain in reading more than that at one gulp.
It is cool to think of reading 128 bits at a time! Wow!
It is cool to think of reading 128 bits at a time! Wow!
well, they didn't invent SSE2 for nothing... 8-)
processors that support SSE2 have sufficient FSB width (athlon 64 -> 128bit, some pentium 4 -> 256 bit).
there's even the newer SSE3.
on systems with "only" 64 bit wide busses you can use MMX, SSE or 3DNOW!.
It all depends on which processor he wants to optimize for. obviously, it may not run on older processors.
processors that support SSE2 have sufficient FSB width (athlon 64 -> 128bit, some pentium 4 -> 256 bit).
there's even the newer SSE3.
on systems with "only" 64 bit wide busses you can use MMX, SSE or 3DNOW!.
It all depends on which processor he wants to optimize for. obviously, it may not run on older processors.
ASKER
mzvika is probably right, bitwise or with 128 bits at once is faster then, 4 bitwise or with 32 bits.
For small arrays (less than proc cash) that would be 4 times speedup. (I'm doing this operation lot of times)
However pArray is more than 10 MBytes long, so processor spends lot of times waiting
for data to load in cash. So speedup probably won't be much significant.
I will try to optimize code in another way. I have different small patterns which I apply to array,
so maybe to precalculate all patterns, then to aply them in parallel, with one iteration through pArray.
Thanks for spending your time.
There's no point in speeding up code that's already many times faster than the memory bus can handle.
.