We help IT Professionals succeed at work.

SIMD - Advantages/Disadvantages and the way to go ....

ikework asked
Medium Priority
Last Modified: 2013-12-26
hi all :)

recently i got a cpu, which is able to use sse/sse2 .. since my projects are mainly
3D & physics - simulations, i started playing around with it a little and tried to
see, what advantages/disadvantages coming up from implementing sse into my basic
layers. i did lots of benchmarking and found, that basically vector-normalization
and matrix-transformation of vector-arrays really have a time improvement. sure both
are very important for my kind of librarys ..

so i have an important decision to make .. it affects all my libs and apps, since data
must be prepared for that functions and there would be no longer vector3 & vector4 - types,
each must be replaced with a homogeneous vector4 and 3x3 rotation matrices must be replaced
with 4x4 matrices

here are my pro & contras i see so far:

* prepared for the future ?!?
* time improvement

* data must be aligned to 16 byte and must fit into a 128-bit register
* to have a consistent library, i have to use vectors with 4 components always, even
   if i only need three components, the same for 3x3 matrices
* code-maintenance is more complex at the lower layer, since some functions are
   implemented in 2 ways
* library-runtime-checking for sse and set functionpointer to decide, which function
   to use, with or without sse
* pure c/c++ - code seems longer to be valid and is cpu-independent, and fpu's are
   getting faster
* increasing memory-size, but thats not really a point for me in these days ..

here are my benchmarks on win with vc71 and pentium4

V3_NORMALIZE            23%
V3_LENGTH_SQR        -11%
V3_LENGTH                   5%
V3_ADD                       -3%
V3_SUB                       -1%
V3_MUL                       -1%
V3_DOTPRODUCT         -3%
V4_DOTPRODUCT         -0%
M_MUL_V                      3%
M_BATCH_MUL_V         22%

M_MUL_V            -> vector4     = matrix44 * vector4
M_BATCH_MUL_V -> vector4[n] = matrix44 * vector4[n]

the processors to use are mainly intel & athlon 32-bit & 64-bit
platforms are win & linux

so my questions are:

1. did you face the same question, and how did you decide? what was your pro & contras

2. i'd like to have a discussion, to see some aspects i didn't see yet or that way..
    not only including time-improvement
actually my intuition tells me, its too much costs. but i think its an important decision,
so i'd like to have as much input as i can

so thanks for input in advance :)

Watch Question

Try to play with compiler optimization settings. Compilers make optimizations for specific processor types, and can generate SSE code. This can improve program performance.
If you want to use SSE, use compiler intrinsics instead of Assembly if they are availble in your compiler.
From my experience, using Assembly gives minimal anvantage over optimized C++ code. I think using SSE and other low-level technologies is important for library developers (like OpenGL or Intel's IPL and PPL), and not so important for application developers.

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts
Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.


Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.