Very Fast Function For Checking If Two Memory Blocks(~100bytes) Are Equal?

magellanLS
magellanLS used Ask the Experts™
on
Hello Experts! :)

Is There A Very Fast Function For Checking If Two Memory Blocks(~100bytes) Are Equal?

Thanks Alot!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I think a simple downto loop is good enough. See if you can compare DWORDs.
Any assembler function will vary in speed depending on the CPU. It should not be that much faster than the loop anyway.
The loop should even be faster because the overhead of calling a function may hamper the assembler function.
Top Expert 2004

Commented:
look at CompareMem

from the delphi help

Performs a binary comparison of two memory images.

Unit

SysUtils

Category

comparison routines

Delphi syntax:

function CompareMem(P1, P2: Pointer; Length: Integer): Boolean; assembler;

Description

CompareMem performs a binary compare of Length bytes of memory referenced by P1 to that of P2.  CompareMem returns true if the memory referenced by P1 is identical to that of P2.

mewikl ;-)
VGR

Commented:
yes.
You may also use a type Record of Byte whose Addr is set to your memory block(s) via absolute, and just compare equality using '='

Or use an array (vector) of Byte for the same purpose
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

VGR, in the end something like CompareMem is called.
VGR

Commented:
yes, I know 8-))

Author

Commented:
Hello!

Robert ,Could you show some code, how you would do it with a downto loop.  

Mewikl ,'CompareMem' is not fast enough.

Thanks!

P.S: Forgot to say that I need to compare two small blocks for equality 1 Million times per second or even more.

Top Expert 2004

Commented:
damn, where comes the w from,
not mewikl -> meikl
I am not sure if it is at all possible. Comparing 100 bytes 1 million times a second means 100.000.000 million instructions at least. If it is always a new array to compare then the memory interface throughput is not fast enough.

This only works if the size of the data is divisible by four.

untested code:

type
  PCompMem = ^TCompMem;
  TCompMem = array [0..255] of DWORD;
var
  I: Integer;
  Cmp1, Cmp2: PCompMem;
begin
  Cmp1 := PCompMem(SomeMem1);
  Cmp2 := PCompMem(SomeMem2);

  Result := False;
  for I := SizeOfAreaInDWORDs downto 0 do
    if Cmp1[I] <> Cmp2[I] then
     begin
       Result := True;
       Break;
     end;
end;

The main speed improvements are
1. the code is inlined so no overhead of calling a function
2. the compares are handled 4 bytes at once
there is another option: to use MMX registers and ops, you can do 64 byte compare in a single instruction

but I am not sure if you can reach a speed you want - there could be some overheads... where do you get these blocks from ? are they all loaded to memory ? give us a large scaled algorythm please

by the way, this is a source of CompareMem()  ;-)
as you can see it is dwords compare op: cmpsd

function CompareMem(P1, P2: Pointer; Length: Integer): Boolean; assembler;
asm
        PUSH    ESI
        PUSH    EDI
        MOV     ESI,P1
        MOV     EDI,P2
        MOV     EDX,ECX
        XOR     EAX,EAX
        AND     EDX,3
        SHR     ECX,1
        SHR     ECX,1
        REPE    CMPSD
        JNE     @@2
        MOV     ECX,EDX
        REPE    CMPSB
        JNE     @@2
@@1:    INC     EAX
@@2:    POP     EDI
        POP     ESI
end;

Author

Commented:
Hello,

Alisher_N ,

>where do you get these blocks from ?
From a big bitmap image

>are they all loaded to memory ?
Yes

 

so it is just a byte sequence, why 1xx byte chunks then ?

Commented:
you only get 70ms overhead for every 1,000,000 function calls so if CompareMem is not fast enough you can only gain less than 1/10 of a second by implementing it inline instead of calling it.  You need a better algorithm I think.

Author

Commented:
Alisher_N ,
>so it is just a byte sequence, why 1xx byte chunks then ?

Becuase, I have 2 Bitmap Images in memory ,instead of comparing 1 pixel(if pixel value dont match, append to a memory buffer) at a time it should be faster comparing by 60(or more) pixels  

Commented:
I peak at about 2.5M transactions per second on an 800 MHz machine for a thread safe object lock and release (separate function calls).  If you use a three processor box and your algorithm is thread safe you should _just_ be able to manage this task, as long as nothing else is going on.

Everything up to the REPE CMPSD is setup.  REPE CMPSD performs all of its work internally (except the actual DWord fetches) in the CPU, so you are running as fast as the CPU will allow.  Have you confirmed that your data falls on DWord boundaries?  This can have significant impact.
never use '60' or '100' numbers when deailng with PC ;-))
use 64 and 128 ;-)

ok, now my point of view to this problem is that you can try to implement this algorythm

0. check for MMX feature in your CPU
1. set up two pointers to both images (byte sequences) as p1 p2, make sure they are aligned to 64 bytes and padded so (X mod 64)=0
2. load 64 bytes in to mm1 from p1
3. compare mm1 to p2^
4. if mistmatch break this loop, go step 7.
5. advance pointers by 64 bytes, break if image is over, goto 8
6. go to step 2

7. images don't match, do something

8. images match, do something else

after implementation you have to spend some time in thorough benchmarking....

p.s.  by the way, if you are trying to do a motion detection from your web camera this way, it maybe not the best approach ;-)

Commented:
Alisher raises a good point.

What is the "end user" version of this problem ... there may be a better approach than BIBF comparisons between massive arrays, or you may need to adjust your expectations.  I don't think PC busses are capable of moving 100,000,000 bytes per second (or even 25,000,000 DWORDS) without special hardware, let alone doing work on them.  

The CPU might handle it, and multiple CPU's will handle it, but the bus won't.

To get the performance you need, you might need to go to a higher performance platform like a PowerPC or a SPARC.  On the intel platform it might be possible horizontally scale the problem across multiple parallel machines, or to queue the problem serially between machines to get adequate performance, depending on the end-user expectations.
magellanLS:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial