?
Solved

Want to improve on the speed of copy memory API

Posted on 2005-02-27
27
Medium Priority
?
481 Views
Last Modified: 2011-04-14
VB & VB.Net are a bit slow at memory copies as theyhave to use CopyMemory in the WINAPI.

1) I intend to create a C++ DLL and use an __ASM code block. When I create the C++ project I suspect I should use a completely blank project and just set project type DLL. Any ideas?

2) Can anybody see any problems with the ASM code below? Like interrtup disable, etc.

3) Do I need to push Si, DI & CX to the stack first?

4) So that the DLL can operate using a VB declare function what do I need to do?

Private Declare Function CopyTopDown Lib "kernel32" Alias "copyMemoryTopDown" (ByRef Destination As intPtr, ByRef Source As String, Length as Long) As Long


Thanks in advance for your advice.

I used to code Assembly for the old IBM AT. So my next question will be should I convert to 32bit and copy 4 bytes at a time.  The problem will be handling the odd bits on the end.


/* Topdown copy

Copies memory starting at the top and working down

*/

int copyMemoryTopDown(int destination, int source, int length){

      __ASM{
      
        // Do I need to PUSH SI,DI & CX here?

      PUSH SI
      PUSH DI
      PUSH CX

      MOV SI, source
      MOV DI, desctination
      MOV CX, length

        // Disable interrupts because of STD ?
      CLI     // Disable interrupts
      STD      // set direction flag to decrement
      REPNZ      // Repeate following until CX=0
      MOVS    // Move memory pointed to by SI to DI pointer
      CLD     // Clear direction Flag
      STI     // Enable interrupts

        // Do I need todo this?

      POP CX  // restor registers
      POP DI
      POP SI

}
return 0;
}


int copyMemoryBottomUp(int destination, int source, int length){

      __ASM{
      
        // Do I need to PUSH SI,DI & CX here?

      PUSH SI
      PUSH DI
      PUSH CX

      MOV SI, source
      MOV DI, desctination
      MOV CX, length


      CLD      // set direction flag to increment
      REPNZ      // Repeate following until CX=0
      MOVS    // Move memory pointed to by SI to DI pointer

        // Do I need restore registers?

      POP CX  // restore registers
      POP DI
      POP SI

}
return 0;
}
0
Comment
Question by:inthedark
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 8
  • 8
  • +1
27 Comments
 
LVL 7

Expert Comment

by:aib_42
ID: 13417259
The PUSH/POPs (which are necessary, btw) won't affect the speed much since they will be executed once per copy. The CLI will probably cause a 'privileged instruction' exception and/or be ignored.

The MOVS instruction (without any operands) is one I haven't seen. Are you sure you don't mean MOVSB?
In any case, dividing the length by 4 and using MOVSD, which moves 4 bytes at a time will be much faster.

There are many "fast memory copy" methods, and some will work better/worse than others on different configurations. Googling for them returns a few examples, one that is striking out being:

http://www.cyborg.ne.jp/~xelf/developer/MemoryCopy.html

(Though the page is in Japanese, the benchmarks and the source code are there.)
0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13418945
Hi, here is my suggestion:

int copyMemory(int destination, int source, int length) {
  __ASM {
    mov esi,source
    mov edi,destination
    mov ecx,length
    rep movsb
  }
  return 0;
}

regards,
Kate
0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13419093
Or another one with use of movsd (4x faster) as aib_42 suggested:

int copyMemory(int destination, int source, int length) {
  __ASM {
    mov esi,source // setup source offset
    mov edi,destination // setup destiny offset
    mov ecx,length // setup length of data
    mov edx,ecx // store length
    shr ecx,2 // divide length by 4
    mov ebx,ecx // store divided length
    shl ebx,2 // multiply it with 4 to get modulo
    sub edx,ebx // calculate module
    rep movsd // copy length div 4 dwords
    mov ecx,edx // setup length of data to length mod 4
    rep movsb // copy length mod 4 bytes
  }
  return 0;
}

regards,
Kate
0
On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

 
LVL 10

Expert Comment

by:_Katka_
ID: 13419123
Oh and btw. push and pops are not needed only
in old 16-bit times when segmenting registers Es,Ds
was used for memory location it was essential
to store those but esi,edi and ecx aren't to be saved.

regards,
Kate
0
 
LVL 17

Author Comment

by:inthedark
ID: 13427819
Thanks _Katka_ , I found the bit which in the documentation whih tells us which registers have to be preserved. (I don't think I will need to mes with the ones that don't.  It also says I can use STI and CLD.
0
 
LVL 17

Author Comment

by:inthedark
ID: 13427824
ps __ASM  through me for a while. __asm works fine.
0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13427926
You're welcome,

STI and CLD may be used almost without limitations
unless you're messing with interrupts themself.

I hope you noticed the "rep movsb/w/d" which repeats
until cx=0 without need of determining the direction.

regards,
Kate
0
 
LVL 22

Expert Comment

by:grg99
ID: 13428613
Why oh why are you doing a CLI/STI?

(1)  It doesnt save any time, the interrupts will have to be handled after the STI

(2) It doesnt make the copy any faster.

(3) If any of the memory you're copying isnt paged into RAM, KABOOM!

(4) You can't really turn off interrupts in any modern OS in user mode, the kernel traps all those special op codes and either ignores or tries to "emulate" them. There's not much the kernel can do to emulate a CLI, except perhaps not reflect interrupts into DOS boxes.

(5)  All that emulation takes a round-trip through the kernel, and many many clock cycles, almost certainly more than your memory copy.


... so don't bother with that.

------------

Also on the Pentiums, it's always faster to do a explicit copy loop:  mov  eax,[esi]; mov [edi],eax; dec ecx; jnz *-8

That's because there's only one movs unit, while Pentiums and Athlons have many functional units that can do those simple instructions.



0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13429297
Just 2 notices

1) what is supposed to do:

mov  eax,[esi]
mov [edi],eax
dec ecx
jnz *-8

2) processor is optimized for move operations to "rep movs?"
    it's determined for it by design

regards,
Kate
0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13429611
I didn't mean it offensive just if you'll take (my fastest approximation of your opinion):

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
@loop:
  mov al,[esi]
  mov [edi],al
  inc esi
  inc edi
  loop @loop
}

against:

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
  rep movsb
}

the benchmarks are clear so I was curious
about your real suggested solution, but I
definitely have to agree with you on CLI/STI
only in the case that some timer was accessing
and modifying the same copyied memory
location it will suitable but I don't think it's
properiet. In case the basal copy arguments
will be optimized for multiplier of 4 then:

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
  rep movsd
}
 
this algorithm is ultimate speed winner for all
processor types.

best regards,
Kate
0
 
LVL 22

Expert Comment

by:grg99
ID: 13429822
asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
  rep movsd
}
>this algorithm is ultimate speed winner for all
processor types.

Not true.  On most CPU's made since 1995, a simple mov eax,[esi], ... style loop goes faster.  As does a move using the floating-point registers or mmx instructions.   I've timed them.

Note that it probably makes no difference in any program running under an interpreter like .net.  The interpreter is going to be scads slower than any move loop.





0
 
LVL 10

Assisted Solution

by:_Katka_
_Katka_ earned 400 total points
ID: 13430156
Well there's error it has to be

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
  shr ecx,2
  rep movsd
}

but otherwise what's you're solution you
compare it with ? I tried 100M copy for:

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
@loop:
  mov al,[esi]
  mov [edi],al
  inc esi
  inc edi
  loop @loop
}

against:

asm {
  mov esi,source
  mov edi,destination
  mov ecx,length
  rep movsb
}

and second one beated the first one 5:6 in speed on
my Athlon XP 3GHz - Winbond 512RAM - Windows XP SP2

I'm just asking because I maybe badly
interpreted your post suggestion.

regards,
Kate
0
 
LVL 22

Accepted Solution

by:
grg99 earned 1200 total points
ID: 13430326
move 4 bytes at a time in the loop, not just one!

even better, move 8 with:

   sub  ecx,7   ; round down to multiple of 8
   jle    @2     ; iff less than 8 left

@1:
   mov    eax,[esi+00]
   mov    ebx,[esi+04]
   mov    [edi+00],eax
   mov    [edi+04],ebx
   lea      esi,[esi+8]
   lea      edi,[edi+8]
   sub     ecx,8
   ja        @1

@2:
    cld
   add ecx,7
   rep   movsb
0
 
LVL 10

Expert Comment

by:_Katka_
ID: 13430654
Pretty one :)
0
 
LVL 7

Assisted Solution

by:aib_42
aib_42 earned 400 total points
ID: 13433343
Hmm, 'hack'ing a DMA controller or two will probably beat any CPU-intensive copying algorithm by having a few ticks of running time independent of the copy size, though I don't think it will beat them in speed :). Remapping virtual memory is also faster than any kind of copying, but would require extreme segmentation (as in "work involving segments", not literal "segmentation" :).

grg99: How about more than 4 bytes at a time?

mov eax,[esi+00]
mov ebx,[esi+04]
mov edx,[esi+08]
...

Since Pentiums (Pro+?) have a register pool AFAIK, it shouldn't slow things down, and would certainly decrease the loop overhead. Actually, there's also something similar:

mov eax,[esi+00]
mov [edi+00],eax
mov eax,[esi+04]
mov [edi+04],eax
mov eax,[esi+08]
mov [edi+08],eax
loop...

(Hinting the loop to be a branching one could also help)

Since the move pairs are independent of eachother, they should be scheduled to run on different units/pipelines, using different 'eax'es from within the "register pool".

There should be lots of articles on the subject; it's weird how a search on "fast memory copy" returns so few.
0
 
LVL 22

Expert Comment

by:grg99
ID: 13433587
>grg99: How about more than 4 bytes at a time?

>mov eax,[esi+00]
>mov ebx,[esi+04]
>mov edx,[esi+08]

For very short moves, and only if the source and destination addresses are all in the cache.
Recall that main memory is like 3 to 5 nsec away, and a 3 GHz  CPU can issue like 45 instructions in that time.
The memory bus saturates long before the CPU runs out of steam, so it's not that important to make this loop run really fast.

DMA transfers are very slow, topping out at 100MBPS, and only for very long blocks that don't cross addressing boundaries.  And all the addresses have to be looked up thru the VM manager into absolute addresses, and VM has to be locked down for the DMA duration.  Very slow and ugly.

Virtual memory doesnt do anything like moving bytes.  It can remap stuff around in the address space, but it can't copy data.



0
 
LVL 17

Author Comment

by:inthedark
ID: 13436660
Thanks everybody for these very interesting comments.  The last time I delved into Assembler was about 1985, things have moved on a pace.  grg99 in the old days you needed STI & STD as changes to direction may affect badly writen device drivers, so before setting CLD you needed STI.  So I think that now you don't need to use STI & STD as the registers used by devices are within different virtual machine.

And you ask why would I want to use CLD.  The answer is that is the very point of this as you can make a very fast sort.  If you have say an array of 4000 elements and you want to insert one record in the middle you can ripple the top elements upward using just 2 machine instructions REP & MOV?. (The question is which MOV?.)

I have been using the windows api copymemory but that makes a copy of the source memory first so using a topdown copy will work twice as fast.  Using this concept with a "binary chop find" and also "binary chop merge" concepts it is possible so sort a million string keys in less than 1 second.

I am having trouble locating a good reference forasm intruction set supported by c++ __asm. Can anybody suggest where this information can be found?

0
 
LVL 17

Author Comment

by:inthedark
ID: 13436723
So it would seem that REP & MOV may be slower than manual memory handling as suggested by aib_42  & grg99.

When I work out whats going on (in the other examples) I will probably use both types of copy and have the "application install" work out which is fastest and simply save the result which can be used for future memory copies.  Knowing the speed of the memory move  also helps to optimize the sort as it defines how many keys you can sort before starting a new batch of keys which will then be merge in with earlier batches.
0
 
LVL 22

Expert Comment

by:grg99
ID: 13437917
It's a poor idea to move a lot of data when sorting.  Much faster and better to move 16-bit indices around.  

0
 
LVL 17

Author Comment

by:inthedark
ID: 13438347
grg99 thats exaclty what I intend to do.  The speed is obtained from moving just pointers to data. But it stars to slow down when your pointers array starts to exceeds 100K bytes and then at 400K it realy becomes slow. So its quicker to open a new batch of keys and merge all your batches at the end.
0
 
LVL 22

Expert Comment

by:grg99
ID: 13438391
You can avoid all that data movement too, by making it a linked list of pointers.  Then you can insert an intem into the list by just setting three pointers,  not by having to move many kbytes.  You may need an auxiliary array if you still need random-access to the data.

0
 
LVL 7

Expert Comment

by:aib_42
ID: 13451482
>Virtual memory doesnt do anything like moving bytes.  It can remap stuff around in the address space, but it can't copy data.

Hency why I said "faster than any kind of copying"... Well, moving/sorting indices or pointers is not very different from this, methodically, but certainly much easier to implement.

>...and only for very long blocks that don't cross addressing boundaries.

Thanks, this is new for me. What kind of boundaries?
0
 
LVL 22

Expert Comment

by:grg99
ID: 13452233
>Thanks, this is new for me. What kind of boundaries?

The original DMA controllers only had 16 bit counters, and fixed bank selects, so DMA transfers could not exceed 64K in length or cross 64K boundaries.  And the DMA controller only knew about physical addressing.  And only had a 8-bit data path, and could only do 1 byte per cycle.  That's why nobody used them for moving data around.

They've probably gotten better in more recent support chips, but only 100 times better I'd estimate.  Still not up to CPU move speeds.

0
 
LVL 22

Expert Comment

by:grg99
ID: 13452331
> in the old days you needed STI & STD as changes to direction may affect badly writen device drivers

Sure, it's quite possible to have an interrupt handler that doesnt properly handle the direction flag.  Done it myself several times.

But typical programs and the OS are already setting and clearing the direction flag, so any problem should show up in milliseconds, and all the time.  A lot of string operations have to move data from the top down, so there's a lot of CPU time extant with the direction flag set "backwards", and things don't crash.   So I don't quite see why the typical user program has to worry about this.  Especially in modern OS's that handle interrupts in kernel mode, whth a whole different set of flags.

Regards,

grg99

0
 
LVL 17

Author Comment

by:inthedark
ID: 13459067
Thanks for the help! I will award points later next week, as I am working away from my development pc at present and can't test the suggestions.
0
 
LVL 17

Author Comment

by:inthedark
ID: 13465266
PS Any suggestions regarding __ASM instruction reference?
0
 
LVL 17

Author Comment

by:inthedark
ID: 14008996
Thnaks for the help.  I learnt something from all of the comments.  In the end I coded both types of loop and did a test to see which was quickest for the first run on each workstation.
0

Featured Post

New benefit for Premium Members - Upgrade now!

Ready to get started with anonymous questions today? It's easy! Learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An introduction to the wonderful sport of Scam Baiting.  Learn how to help fight scammers by beating them at their own game. This great pass time helps the world, while providing an endless source of entertainment. Enjoy!
If you're a modern-day technology professional, you may be wondering if certifications are really necessary. They are. Here's why.
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question