I know that sse3 has haddps. However, currently I am implementing horizontal add for sse2-only CPU. My code is:
__asm pshufd xmm6, xmm7, 00110001b;
__asm addps xmm7, xmm6;
__asm pshufd xmm6, xmm7, 00000010b;
__asm addss xmm6, xmm7;
however, addps and addss are slow instruction with latency of 5.
Inserting the above code makes my program very slow, about 15% slower.
Is there a better way to code for horizontal add?