[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1676
  • Last Modified:

about _mm_load_ps


In the SSE intristics,
__m128 _mm_set_ps (float x,float y,float z,float s)
x,y,z,s are first move to aligned memory location first.
Then MOVAPS is use to move values from aligned memory to xmm register.

Is there a direct way to do this?
move float x to [0  ... 31] of xmm register
move float y to [32  ... 63] of xmm register
move float z to [64  ... 95] of xmm register
move float s to [96  ... 127] of xmm register

Thank you.
  • 4
  • 3
1 Solution
You can write your own function, but you can also use:
__m128 _mm_load_ps(float * p); - Loads four single-precision, floating-point values. The address must be 16-byte aligned.
However, you will need to to be sure that your float *p array is 16-byte aligned.

Also take a look here, it is good example how to use SSE:
hengck23Author Commented:
Hi, thank you for the answer. I know about the intristics _mm_load_ps.
However, my data is scattered here and there. The data used depends on the
online calculation and I cannot pre-aligned them.

Hence I am looking for some solution that can load 32 bit from memory straight into 32 bit of xmm register.
E.g. xmm0[32...63] <---- [memLocation(32-bit aligned)].

There is movlpd, but it handles 64-bit at a time.

Thank you.
hengck23Author Commented:
Currently, my implementation is very very slow!
For example, it takes 15 instructions just to load my data into xmm1! (see below)

I want to do:

__asm      mov      edi,      DWORD PTR [ecx+_$CAD_P0S]; //(int**)cascade->p0s
__asm      mov      eax,      DWORD PTR [edi+ebx*8];     //(int*) cascade->p0s[0]
__asm      mov      esi,      [edx];
__asm      mov      edi,      DWORD PTR [eax+esi*4];     //(int)  cascade->p0s[0][offsets[0]]
__asm      mov      [ebp+_alignedMem$$] , edi;              //_alignedMem[0]

__asm      mov      esi,      [edx+4];  
__asm      mov      edi,      DWORD PTR [eax+esi*4];       //(int)  cascade->p0s[0][offsets[1]]
__asm      mov      DWORD PTR [ebp+_alignedMem_1$$  ],  edi;  //_alignedMem[1]

__asm      mov         esi,      [edx+8];
__asm      mov      edi,      DWORD PTR [eax+esi*4];
__asm      mov      DWORD PTR [ebp+_alignedMem_2$$  ], edi;

__asm      mov      esi,      [edx+12];
__asm      mov      edi,      DWORD PTR [eax+esi*4];
__asm      mov      DWORD PTR [ebp+_alignedMem_3$$ ], edi;

__asm      movdqa      xmm1,      DWORD PTR [ebp+_alignedMem$$ ];
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

1. You can try to make your p0s array aligned.
2. There are
   MOVSD xmm1, xmm2/m64 Move scalar double-precision floating-point value from
   MOVSD xmm2/m64, xmm Move scalar double-precision
 commands that allows to move 64-bits to XMM (but it is low 64 bits)
3. You can try to use 'rep movsd' command to copy p0s[0] to aligned memory.
  However, some optimization manuals tell that severla mov 32-bit instructions are more effective than 'rep movsd'

So option (1) seems to be the best. Try to allocate memory 16-byte aligned.
hengck23Author Commented:

Thank you for your reply. I manage to find a faster way to move the floats into the xmm register using gather/scatter method(data swizzle. Surprising, this method is even faster than unaligned move movdqu.
(i have posted a smiliar question at intel forum:
data swizzling code:
   __asm mov  eax, DWORD PTR [edi+ebx*8];     //int*data
   __asm mov  esi, [edx]; // offsets[0]
   __asm mov  edi, [edx+4]; //offsets[1]
   __asm movss xmm1, DWORD PTR [eax+esi*4]; // 0 0 0 data[offset[0]]
   __asm movss xmm5, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[1]]
   __asm mov  edi, [edx+8]; //offsets[2]
   __asm movss xmm2, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[2]]
   __asm mov  edi, [edx+12]; //offsets[3]
   __asm movss xmm3, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset3]]

   __asm movlhps xmm1, xmm2; // 0 data2 0 data0
   __asm shufps  xmm5, xmm3, 00010001b ;// data3 0 data1 0
   __asm xorps xmm1, xmm5; // data3 data2  data1 data0
Wow, Code looks shorter...
By the way  Iwould suggest you to check what code is quicker by
'rdtsc' (0F 31) command before and after the code.
It will give you EXACT number of clocks per code...

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now