implementing horizontal add for sse2?

Dear all,
 
I know that sse3 has haddps. However, currently I am implementing horizontal add for sse2-only CPU.  My code is:
 
 __asm pshufd  xmm6, xmm7,  00110001b;  
  __asm addps  xmm7, xmm6;
  __asm pshufd  xmm6, xmm7,  00000010b;  
  __asm addss   xmm6, xmm7;
 
however, addps and addss  are slow instruction with latency of 5.
Inserting the above code makes my program very slow, about 15% slower.
Is there a better way to code for horizontal add?
 
thank you.
hengck23Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

dimitryCommented:
In the next document Intel guys wrote next:
http://www.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf

The most common operation performed in a vertex shader is the scalar product, where 3 (or 4) pairs of
single-precision data elements are multiplied and the 3 (or 4) results summed. Due to the AOS organization of
the vertex database, evaluating the scalar product can be challenging with SSE because of the lack of horizontal
instructions. We have added horizontal floating-point addition/subtraction instructions to speed up the evaluation of scalar products.

Code with SSE3:
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0

Code without SSE3:
mulps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a
addps xmm0, xmm1

Hope it helps...
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mbizupCommented:
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I will leave the following recommendation for this question in the Cleanup topic area:
    Accept: dimitry {http:#12552123}

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

mbizup
EE Cleanup Volunteer
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I will leave the following recommendation for this question in the Cleanup topic area:
    Accept: dimitry {http:#12552123}

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

mbizup
EE Cleanup Volunteer
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Assembly

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.