about _mm_load_ps


In the SSE intristics,
__m128 _mm_set_ps (float x,float y,float z,float s)
x,y,z,s are first move to aligned memory location first.
Then MOVAPS is use to move values from aligned memory to xmm register.

Is there a direct way to do this?
move float x to [0  ... 31] of xmm register
move float y to [32  ... 63] of xmm register
move float z to [64  ... 95] of xmm register
move float s to [96  ... 127] of xmm register

Thank you.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

You can write your own function, but you can also use:
__m128 _mm_load_ps(float * p); - Loads four single-precision, floating-point values. The address must be 16-byte aligned.
However, you will need to to be sure that your float *p array is 16-byte aligned.

Also take a look here, it is good example how to use SSE:
hengck23Author Commented:
Hi, thank you for the answer. I know about the intristics _mm_load_ps.
However, my data is scattered here and there. The data used depends on the
online calculation and I cannot pre-aligned them.

Hence I am looking for some solution that can load 32 bit from memory straight into 32 bit of xmm register.
E.g. xmm0[32...63] <---- [memLocation(32-bit aligned)].

There is movlpd, but it handles 64-bit at a time.

Thank you.
hengck23Author Commented:
Currently, my implementation is very very slow!
For example, it takes 15 instructions just to load my data into xmm1! (see below)

I want to do:

__asm      mov      edi,      DWORD PTR [ecx+_$CAD_P0S]; //(int**)cascade->p0s
__asm      mov      eax,      DWORD PTR [edi+ebx*8];     //(int*) cascade->p0s[0]
__asm      mov      esi,      [edx];
__asm      mov      edi,      DWORD PTR [eax+esi*4];     //(int)  cascade->p0s[0][offsets[0]]
__asm      mov      [ebp+_alignedMem$$] , edi;              //_alignedMem[0]

__asm      mov      esi,      [edx+4];  
__asm      mov      edi,      DWORD PTR [eax+esi*4];       //(int)  cascade->p0s[0][offsets[1]]
__asm      mov      DWORD PTR [ebp+_alignedMem_1$$  ],  edi;  //_alignedMem[1]

__asm      mov         esi,      [edx+8];
__asm      mov      edi,      DWORD PTR [eax+esi*4];
__asm      mov      DWORD PTR [ebp+_alignedMem_2$$  ], edi;

__asm      mov      esi,      [edx+12];
__asm      mov      edi,      DWORD PTR [eax+esi*4];
__asm      mov      DWORD PTR [ebp+_alignedMem_3$$ ], edi;

__asm      movdqa      xmm1,      DWORD PTR [ebp+_alignedMem$$ ];
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

1. You can try to make your p0s array aligned.
2. There are
   MOVSD xmm1, xmm2/m64 Move scalar double-precision floating-point value from
   MOVSD xmm2/m64, xmm Move scalar double-precision
 commands that allows to move 64-bits to XMM (but it is low 64 bits)
3. You can try to use 'rep movsd' command to copy p0s[0] to aligned memory.
  However, some optimization manuals tell that severla mov 32-bit instructions are more effective than 'rep movsd'

So option (1) seems to be the best. Try to allocate memory 16-byte aligned.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
hengck23Author Commented:

Thank you for your reply. I manage to find a faster way to move the floats into the xmm register using gather/scatter method(data swizzle. Surprising, this method is even faster than unaligned move movdqu.
(i have posted a smiliar question at intel forum:
data swizzling code:
   __asm mov  eax, DWORD PTR [edi+ebx*8];     //int*data
   __asm mov  esi, [edx]; // offsets[0]
   __asm mov  edi, [edx+4]; //offsets[1]
   __asm movss xmm1, DWORD PTR [eax+esi*4]; // 0 0 0 data[offset[0]]
   __asm movss xmm5, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[1]]
   __asm mov  edi, [edx+8]; //offsets[2]
   __asm movss xmm2, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[2]]
   __asm mov  edi, [edx+12]; //offsets[3]
   __asm movss xmm3, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset3]]

   __asm movlhps xmm1, xmm2; // 0 data2 0 data0
   __asm shufps  xmm5, xmm3, 00010001b ;// data3 0 data1 0
   __asm xorps xmm1, xmm5; // data3 data2  data1 data0
Wow, Code looks shorter...
By the way  Iwould suggest you to check what code is quicker by
'rdtsc' (0F 31) command before and after the code.
It will give you EXACT number of clocks per code...
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.