about _mm_load_ps

Posted on 2004-11-06
Last Modified: 2008-01-09

In the SSE intristics,
__m128 _mm_set_ps (float x,float y,float z,float s)
x,y,z,s are first move to aligned memory location first.
Then MOVAPS is use to move values from aligned memory to xmm register.

Is there a direct way to do this?
move float x to [0  ... 31] of xmm register
move float y to [32  ... 63] of xmm register
move float z to [64  ... 95] of xmm register
move float s to [96  ... 127] of xmm register

Thank you.
Question by:hengck23
    LVL 11

    Expert Comment

    You can write your own function, but you can also use:
    __m128 _mm_load_ps(float * p); - Loads four single-precision, floating-point values. The address must be 16-byte aligned.
    However, you will need to to be sure that your float *p array is 16-byte aligned.

    Also take a look here, it is good example how to use SSE:

    Author Comment

    Hi, thank you for the answer. I know about the intristics _mm_load_ps.
    However, my data is scattered here and there. The data used depends on the
    online calculation and I cannot pre-aligned them.

    Hence I am looking for some solution that can load 32 bit from memory straight into 32 bit of xmm register.
    E.g. xmm0[32...63] <---- [memLocation(32-bit aligned)].

    There is movlpd, but it handles 64-bit at a time.

    Thank you.

    Author Comment

    Currently, my implementation is very very slow!
    For example, it takes 15 instructions just to load my data into xmm1! (see below)

    I want to do:

    __asm      mov      edi,      DWORD PTR [ecx+_$CAD_P0S]; //(int**)cascade->p0s
    __asm      mov      eax,      DWORD PTR [edi+ebx*8];     //(int*) cascade->p0s[0]
    __asm      mov      esi,      [edx];
    __asm      mov      edi,      DWORD PTR [eax+esi*4];     //(int)  cascade->p0s[0][offsets[0]]
    __asm      mov      [ebp+_alignedMem$$] , edi;              //_alignedMem[0]

    __asm      mov      esi,      [edx+4];  
    __asm      mov      edi,      DWORD PTR [eax+esi*4];       //(int)  cascade->p0s[0][offsets[1]]
    __asm      mov      DWORD PTR [ebp+_alignedMem_1$$  ],  edi;  //_alignedMem[1]

    __asm      mov         esi,      [edx+8];
    __asm      mov      edi,      DWORD PTR [eax+esi*4];
    __asm      mov      DWORD PTR [ebp+_alignedMem_2$$  ], edi;

    __asm      mov      esi,      [edx+12];
    __asm      mov      edi,      DWORD PTR [eax+esi*4];
    __asm      mov      DWORD PTR [ebp+_alignedMem_3$$ ], edi;

    __asm      movdqa      xmm1,      DWORD PTR [ebp+_alignedMem$$ ];
    LVL 11

    Expert Comment

    1. You can try to make your p0s array aligned.
    2. There are
       MOVSD xmm1, xmm2/m64 Move scalar double-precision floating-point value from
       MOVSD xmm2/m64, xmm Move scalar double-precision
     commands that allows to move 64-bits to XMM (but it is low 64 bits)
    3. You can try to use 'rep movsd' command to copy p0s[0] to aligned memory.
      However, some optimization manuals tell that severla mov 32-bit instructions are more effective than 'rep movsd'

    So option (1) seems to be the best. Try to allocate memory 16-byte aligned.
    LVL 11

    Accepted Solution


    Author Comment


    Thank you for your reply. I manage to find a faster way to move the floats into the xmm register using gather/scatter method(data swizzle. Surprising, this method is even faster than unaligned move movdqu.
    (i have posted a smiliar question at intel forum:
    data swizzling code:
       __asm mov  eax, DWORD PTR [edi+ebx*8];     //int*data
       __asm mov  esi, [edx]; // offsets[0]
       __asm mov  edi, [edx+4]; //offsets[1]
       __asm movss xmm1, DWORD PTR [eax+esi*4]; // 0 0 0 data[offset[0]]
       __asm movss xmm5, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[1]]
       __asm mov  edi, [edx+8]; //offsets[2]
       __asm movss xmm2, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset[2]]
       __asm mov  edi, [edx+12]; //offsets[3]
       __asm movss xmm3, DWORD PTR [eax+edi*4]; // 0 0 0 data[offset3]]

       __asm movlhps xmm1, xmm2; // 0 data2 0 data0
       __asm shufps  xmm5, xmm3, 00010001b ;// data3 0 data1 0
       __asm xorps xmm1, xmm5; // data3 data2  data1 data0
    LVL 11

    Expert Comment

    Wow, Code looks shorter...
    By the way  Iwould suggest you to check what code is quicker by
    'rdtsc' (0F 31) command before and after the code.
    It will give you EXACT number of clocks per code...

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    If you're not part of the solution, you're part of the problem.   Tips on how to secure IoT devices, even the dumbest ones, so they can't be used as part of a DDoS botnet.  Use PRTG Network Monitor as one of the building blocks, to detect unusual…
    Digital marketing agencies have encountered both the opportunities and difficulties that emerge from working with a wide-ranging organizations.
    how to add IIS SMTP to handle application/Scanner relays into office 365.
    Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now