?
Solved

ASM addressing models

Posted on 2000-01-02
19
Medium Priority
?
494 Views
Last Modified: 2012-06-21
I have some asm code in a VC6 fx that need to access a table of data, if I have a static block of data I get a different answer than if I just access the data from a pointer.

To get around this I copy the data into a local buffer but this seems silly.

In short how do I remove the memcpy's in the following code?

unsigned long cCRCEngine::UpdCustom(unsigned char *pb_data, long lCnt, unsigned char* pabTable)
{
      unsigned long paltable[256];
      unsigned short pastable[256];

      if(32 bit)
      {
            memcpy(paltable,pabTable,1024);
            _asm
            {
                  arrayloop:                        // loop come back to here!
                        //   table[x] ^ dw_accum;
                        XOR     EAX, paltable[EDX]
                  LOOP    arrayloop            // if(--ECX != 0) goto arrayloop!
            }
      }
      else /* 16 bit */
      {
            memcpy(pastable,pabTable,512);
            _asm
            {
                  arrayloopx:                        // loop come back to here!
                        XOR     AX, pastable[EDX]      // ^ crcccittf[?]
                  loop    arrayloopx            // if(--ECX != 0) goto arrayloop!
            }
      }
}

0
Comment
Question by:chris_a
  • 8
  • 7
  • 3
  • +1
19 Comments
 
LVL 32

Expert Comment

by:jhance
ID: 2319043
You said:

"code in a VC6"

but in your example, you are implying both 16 and 32 bit code:

if(32 bit)
else /* 16 bit */


VC6 is 32-bit only and has only the one (i.e. 32-bit flat) memory model.  Is there more to this than you've let on?  Are you using something other than VC6?
0
 
LVL 2

Author Comment

by:chris_a
ID: 2319060
The 16 and 32 bit in this refer to the width of the CRC precalculated tables ie CRC32 or CRC16.
0
 
LVL 2

Author Comment

by:chris_a
ID: 2319072
The 16 and 32 bit in this refer to the width of the CRC precalculated tables ie CRC32 or CRC16.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
LVL 1

Accepted Solution

by:
TheMadManiac earned 400 total points
ID: 2320179
The difference between accessing a pointer and static data is just that..

you can access through pointers like this:

mov eax, pointer_variable // pointer to data
mov [eax],something // put something in pointer_variable[0]

would do the same as:

mov variable[0],something // static data

the brackets around the eax means that it should not be used as a register (ie store/retreive eax) but to use it as a pointer and modify/read what eax is pointing at.

What you would get if you do:

mov pointer_variable[0],eax

is just that.. store eax in the first pointer of the 'list' pointer_variable, like it was a pointer pointer.

Changing your code would not make a lot sence for me as it's obviously you pasted just what was needed to see the problem (you use uninitialised variables and incrementation is removed)
However the above should explain what you did wrong and how to fix it. The memcopy is then not needed.

Floris
0
 
LVL 22

Expert Comment

by:nietod
ID: 2320621
if(32 bit)
{
   memcpy(paltable,pabTable,1024);
   _asm
   {
       XOR EAX,EAX
       MOV ECX,256
       LEA  EBX,paltable

arrayloop:
       XOR     EAX,DWORD PTR [EBX]
       ADD     EBX,4
       LOOP    arrayloop // if(--ECX != 0) goto arrayloop!
   }
}
else /* 16 bit */
{
   memcpy(pastable,pabTable,512);
   _asm
    {
       XOR AX,AX
       MOV ECX,256
       LEA  EBX,paltable

arrayloopx: // loop come back to here!
       XOR     AX,WORD PTR [EBX]
       ADD     EBX,2
       loop    arrayloopx // if(--ECX != 0) goto arrayloop!
    }
}
0
 
LVL 2

Author Comment

by:chris_a
ID: 2321387
The answers have gone over my head hear, still got concussion from the fireworks along the Thames in London.

If I post the whole routine you can all have a laugh, and maybe optimize it for me.

I only did a little assembly [6809] at college, and one commercial program in 8048 so I can't even claim to be rusty!

###########################################################

unsigned long cCRCEngine::UpdCustom(
      unsigned char *pb_data, long lCnt,
      unsigned char* pabTable, unsigned char bWidth)
{
      unsigned long paltable[256];
      unsigned short pastable[256];

      if(! (((bWidth==2) || (bWidth==4)) && (pabTable != 0)) )
      {
            return 0;
      }

      // get xor value off the end of the table
      long* pXOR = (long*)(((unsigned long)(void*)pabTable) + (bWidth << 8));
      unsigned long dw_CustXOROP = *pXOR;
            
      if(bWidth==4)
      {
            unsigned long oldcrc;
            memcpy(paltable,pabTable,1024);
            
            oldcrc = dw_accum ^ dw_CustXOROP;
            
            _asm
            {
                  // copy data from c to asm
                  mov     EBX, pb_data      // Load address of buffer
                  mov     ECX, lCnt            // Loop limit
                  mov     EAX, oldcrc            // current value
                  
                  // sanity check
                  jecxz   getout                  // no data, then leave
                  or      ebx, ebx            // set flag if this is a null ptr...
                  jz      getout                  // .. and then get out!
                        
                  arrayloop:                        // loop come back to here!
                  
                        // move from buffer to register
                        MOV     DL, [EBX]
                        INC     EBX
                        
                        // calculate offset into table, store in edx
                        //   x = (dw_accum ^ pb_data[k]) & 0xFFL
                        XOR     DL, AL                  
                        MOVZX   EDX, DL            

                        //   dw_accum >> 8
                        SHR     EAX, 8          

                        //   table[x] ^ dw_accum;
                        SHL            EDX, 2      // long offset into table
                        XOR     EAX, paltable[EDX]
                        
                        // now eax has the current value
                  
                  LOOP    arrayloop            // if(--ECX != 0) goto arrayloop!
                  
                  MOV     oldcrc, EAX            // save current value
                        
                  getout:                              // jump target to leave!
            }
            
            dw_accum = oldcrc ^ dw_CustXOROP;
      }
      else
      {
            unsigned short oldcrc = (unsigned short) (dw_accum ^ dw_CustXOROP);
            memcpy(pastable,pabTable,512);

            _asm
            {
                  // copy data from c to asm
                  mov     EBX, pb_data      // Load address of buffer
                  mov     ECX, lCnt            // Loop limit
                  mov     AX, oldcrc            // current value

                  // sanity check
                  jecxz   getout                  // no data, then leave
                  or      ebx, ebx            // set flag if this is a null ptr...
                  jz      getoutx                  // .. and then get out!
                        
                  arrayloopx:                        // loop come back to here!

                        // calculate offset into table, store in edx
                        MOV            EDX, EAX                  // (w_accum >> 8)
                        SHR            EDX, 8
                        XOR            EDX, [EBX]                  // ? ^ pb_data[k];
                        INC     EBX
                        AND            EDX, 0xFF
                        SHL     EAX, 8                        // (w_accum << 8)
                        SHL            EDX, 1                        // *2
                        XOR     AX, pastable[EDX]      // ^ crcccittf[?]

                  loop    arrayloopx            // if(--ECX != 0) goto arrayloop!

                  mov     oldcrc, AX            // save current value
                        
                  getoutx:                        // jump target to leave!
            }

            dw_accum = (unsigned long) oldcrc ^ dw_CustXOROP;
      }

      return dw_accum;
}

###########################################################
0
 
LVL 22

Expert Comment

by:nietod
ID: 2321452
That code looks okay to me.  It isn't using the data you memcpy(), its using the data passed to the function.

Why use assembly in the first place?  This hardly seems worth it, especially if you don't know assembly.
0
 
LVL 2

Author Comment

by:chris_a
ID: 2322392
I am fairly sure it is using the copied data, pabTabe is copied into paltable for 32 bit or pastable for 16 bit.

The reason for the ASM is just for speed and my education, this is a fragment of the code in my crcocx, it is typically used to speed up such calculations in VB, it has been downloaded a few thousand times and about a hundred users have sent me emails so I guess it is quite widely used.

I translated all the standard CRC calculations to ASM and it doubled the speed, I am adding custom CRCs now so it seemed best to do that in ASM too.

Once this is done I think the OCX will be finished, it CRCs strings, arrays variants and files on disk synchronously or asynchronously.
0
 
LVL 22

Expert Comment

by:nietod
ID: 2322520
The code you posted does

mov     EBX, pb_data // Load address of buffer

at the start of the assembly.  pb_data is the parameter to the procedure, right?  (this is a little hard to read.)  So it is using the data passed to the procedure, not the data you copied to the local array.

Or am I missing something?

Here are some assembly tips:

for

or      ebx, ebx // set flag if this is a null ptr...

this never changes ebx, just the flags, but the processor doesn't "know" that because OR usually can change the destination register.  Because of this the OR instruction may cause other pipes to stall until the result of the OR is available.  So for this case always use TEST instead of OR  TEST never changes the destination so the processor doesn't ever stall.

MOVZX   EDX, DL

can be replaced with

AND EDX,0FFh

to save two clock cycles.

However, since you load DL a few instructions before, you would be best off to do

MOVZX EDX,BYTE PTR [EBX]

This clears the hight three bytes and loads the low byte in one 3 cycle step.
0
 
LVL 2

Author Comment

by:chris_a
ID: 2322557
Ah, now I see the confussion, yes pb_data is the data being processed, my problem is pabTable, this is the precalculated values of a custom crc algorythm.

When I do the standard (PKZip etc) CRCs, I use a global (static) buffer/table, but if I use the same code to access a dynamic buffer (with identical contents in the VC memory window) I get different results.

To get around that I declared a local buffer and copy the custom CRC table in each time, this seems ineffiecient to me, especially as the total CRC calculation may be split into sections to allow a progress bar to be updated. If this happens I may end up copying this buffer hundreds of times.
0
 
LVL 1

Expert Comment

by:TheMadManiac
ID: 2322573
you can make a pointer to the static buffer, in effect making the static buffer accessible the same as you would a dynamic allocated buffer.

int blah[80];

int *ptr=blah;

Floris
0
 
LVL 22

Expert Comment

by:nietod
ID: 2322609
I see.

XOR     EAX, paltable[EDX]

Should use the table specified in the parameters.    The difference is that paltable is an array located on the stack and that baptable is a pointer to an array.  So given the pointer, you need to use it to access the array.  To do that load the pointer to the table in a register before the loop, like ESI, then use base+index addressing to get the data from the table, like

XOR EAX,WORD PTR [ESI+EDX]

0
 
LVL 22

Expert Comment

by:nietod
ID: 2322620
TheMadManiac, (Or may I call you mad?) that is not enough, the problem is the addressing mode.
0
 
LVL 1

Expert Comment

by:TheMadManiac
ID: 2322648
call me anything youy want :) I would change my nick to Tamama but i cant.

My first post already addressed the addressing part of accessing pointers. Although it did not give index addressing by use of extra registers etc.

The register used for the pointer does not matter in 32 bit mode, although a 32 bit register would be adviced ;-) I usualy use eax for the most used register as it usually makes instructions shorter.

Floris
0
 
LVL 2

Author Comment

by:chris_a
ID: 2322868
Can I just add the pointer to EDX and use these two?

32 bit
XOR EAX,DWORD PTR [EDX]

16 bit
XOR EAX,WORD PTR [EDX]
0
 
LVL 22

Expert Comment

by:nietod
ID: 2322917
>> I usualy use eax for the most used register as it
>> usually makes instructions shorter
that is true for interger operations, but not for address operations.  It requires the extended addressing modes (386+) which will add two bytes to an instruction.  To keep addressing short and fast use BX/EBX and BP/EPB for base registers and use SI/ESI and DI/EDI for index registers.  Any other register used as a base or index will reuire the extended addressing modes and 2 more bytes per instruction.

>> already addressed the addressing part of
>> accessing pointers
Its confusing because he's not accessing static data.  currently the data is on the stack.  The instruction

XOR     EAX, paltable[EDX]

looks like it is accessing static data, and would have to in pure assembly.  But in the C++ compiler the assembler/compiler "knows" that paltable is on the stack, so it "alters" the instruction to generate base+index addressing using BP, i.e it does

XOR EAX,[EBP+EDX+Off]

where Off is an offset it calculates to the start of the array in the locals.

>> Can I just add the pointer to EDX and use these two?
yes, but you will need to do so each time through the loop.  That is each time in the loop you will have to calculate EDX and then add on the value stored in the pointer, then use the result.  This is likely to be a little longer and slower (due to stalls) than what I proposed.  

furthermore, since you will be using one of the extended addressig modes (EDX is being used as an base/index), you might consider using a scaled index, then you wouldn't have to do the SHL instruction, like

XOR EAX,WORD PTR [ESI+4*EDX];

This doesn't cost extra in time or space over the non-scaled version and saves you the time (and stalls) and space associated with performing that SHL
0
 
LVL 2

Author Comment

by:chris_a
ID: 2323018
So I short, I should load it into ESI and offset from that.

I don't see why
  XOR EAX,WORD PTR [ESI+4*EDX];
is faster than
  SHL EDX, 2
  XOR EAX,WORD PTR [ESI+EDX];
I thought shifts were way quicker than multiplies

plus
Use
  TEST EBX
and
  MOVZX EDX,BYTE PTR [EBX]
to get a byte from the buffer.

I shall test this lot tonite, and post a second question so you both get the points.

0
 
LVL 22

Expert Comment

by:nietod
ID: 2323207
>> I thought shifts were way quicker than multiplies
True, but this doesn't use a "generic" multiply.  This EA mode scales (multiplies) the index register by 1, 2, 4, or 8.  All of which are accomplished by shifts.  More importantly this scaling is done by the dedicated EA calculation hardware, so the entire EA ([ESI+4*EDX]) is calculated in one clock cycle.  And with luck it is calculated before it is actually needed, so the XOR instruction will not require any time to calculate the EA.  In your case, you willl probably adjust EDX right before the XOR instruction, so the EA will have to be calculated as part of the XOR isntruction, so you may be penalized 1 clock cycle for this, but the altenative is to do a shift and an add as two seperate instructions, each of these will take a clock cycle (They can't be in seperate pipes as the add will stall until the shift is finished)  Then when these are done, you do the XOR, but again you will need to calculate an EA and again the registers used in the EA were just changed, so you may still be penalized a clock cycle in the XOr instruction.  So using the scaled EA I suggested should save you 2 clock cycles.

FYI any EA (no matter how simple, like [EBX] or how complex like [EAX + 8*EBX+52]) takes 1 clock cycle on a 386 or later processor.  One a 486 or later however, the processor may be able to do that calculation before the EA is needed, so often that clock cycle isn't really needed.  However if the registers used in the EA is changed right before the instruction that needs the EA, then you might have to wait that extra cycle (sometimes not).
0
 
LVL 2

Author Comment

by:chris_a
ID: 2343469
Thanks for the help chaps - you can see the results in http://www.preface.co.uk/crcocxb.zip if you like

I will post another Q for nietod
0

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.

589 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question