• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 290
  • Last Modified:

Fast Array Bitwise OR?

I 've got one array with bit pattern, and other array which I would like to Bitwise OR with pattern.
Are there instructions (MMX, SSE, ...) which allow to do this fast, (like REP MOVSD for fast array copy).

This is part of code in C that I want to optimize:

                  for(i=0; i<iSize; i++) {
                        pArray[i] |= pattern[i % iPatternSize];

  • 3
  • 2
1 Solution
you could use the SSE2 instructions, which operate on 128bit registers at a time.

MOVDQU                will load a double-quadword from memory to a 128bit register
db 066h POR          will perform bitwise OR between it's operands

(NOTE: the db 066h is part of the instruction. using just POR will operate on MMX 64 bit registers. we want 128bit).
it's not going to make much difference-- or'ing is something CPU's do very quickly-- much quicker than the typical RAM.

There's no point in speeding up code that's already many times faster than the memory bus can handle.

right, so we want to minimize memory access.
why read 1 byte at a time when we can read 32 bytes... ?
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

Most PC memory buses are only 4 or 8 bytes wide.   So there's not much gain in reading more than that at one gulp.

It is cool to think of reading 128 bits at a time!    Wow!

well, they didn't invent SSE2 for nothing... 8-)
processors that support SSE2 have sufficient FSB width (athlon 64 -> 128bit, some pentium 4 -> 256 bit).
there's even the newer SSE3.

on systems with "only" 64 bit wide busses you can use MMX, SSE or 3DNOW!.
It all depends on which processor he wants to optimize for. obviously, it may not run on older processors.
milosrAuthor Commented:

mzvika is probably right, bitwise or with 128 bits at once is faster then, 4 bitwise or with 32 bits.
For small arrays (less than proc cash) that would be 4 times speedup. (I'm doing this operation lot of times)

However pArray is more than 10 MBytes long, so processor spends lot of times waiting
for data to load in cash. So speedup probably won't be much significant.

I will try to optimize code in another way. I have different small patterns which I apply to array,
so maybe to precalculate all patterns, then to aply them in parallel, with one iteration through pArray.

Thanks for spending your time.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now