Solved

How fast is your ASM hex converter (3)

Posted on 2002-03-31
32
1,019 Views
Last Modified: 2009-04-14
Thread summary:
See http:Q.20280946.html for the original challenge

o *abel* dropped by, said "hmmmmm...." then dropped off the end of the earth.
o *AvonWyss* submitted a lookup-based algorithm that kicked butt.
o *Pavlik* submitted straight C code which beat all ASM to that point.  The *compiler* (helped by an excellent, efficient algorithm) seemed to be smarter than all of us!
o *DanRollins* sipped a brew while generating some benchmarking code.
o *Jonin's* sprintf-based code illustrated an important point:  The 'obvious' way is at least 100 times slower than a optimized, targeted solutions.
o AvonWyss's second solution used a clever non-lookup technique in both the hex and ASCII part. He even explained how that darn thing worked!
o *absong* showed an interesting hex-calc technique that used the upper nibble (after a subtraction) as a mask.  Clever, but failed in certain situations.
o We opened new thread http:Q.20280946.html
o AvonWyss and DanRollins worked on minor tweeks to Avon's non-lookup code trying to sqeeze a few cycles out.
o able magically reappeared, bearing an MMX-based solution that knocked our sox off! (at least twice as fast as the fastest, but initially flawed in that it was not spacing out the hexdigit pairs).  Abel's code forsakes the lookup tables in favor of the MMX ability to process 8 bytes in parallel.  
o The spacing-out of the hex digits has ended up being complicated and time-consuming, but we have worked out several promising techniques.  The resulting code remains fastest to date: about 50% faster than Pavlik's pure lookup.

o We continued here, with these thoughts in mind:

? EMMX: worth delving into?
? Can Prefetch give new life to lookup tables?
? Can code-interleave to avoid pipeline stalls speed things up?
? Loading MMX mask values from 32-bit regs -- faster?

Come on in!  
0
Comment
Question by:DanRollins
  • 19
  • 8
  • 4
  • +1
32 Comments
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
Avon,
I did check out your link about optimization:
   http://www.agner.org/assem
and here is the link to the Intel P3 manuals, including one with optimization guidelines:
 http://developer.intel.com/design/PentiumIII/manuals/

The concepts are pretty clear, but implementing the suggestions seems difficult (scramble the code, time it, compare to previously-scrambled code, etc...).

In fact, it seems exactly like something a computer could do.  Has anyone ever heard of a program that takes a stream of P3 opcodes and analyses them, and then reorders them to maximize performance?  I see Intel provides a 30-day trial for its VTune(tm) toolkit...

-- Dan
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
able,
In attempting to avoid loading the masks from RAM, I tried:

#define M_LoadMaskPsllq( mm, n ) __asm{\
__asm mov eax, n \
__asm movd mm0, eax \
__asm movq mm, mm0 \
__asm psllq mm,32 \
__asm por mm,mm0 \
}

#define M_LoadMaskUnpk( mm, n ) __asm{\
__asm mov eax, n \
__asm movd mm0,eax \
__asm punpcklbw mm,mm0 /*I am too sexy for my shirt!*/\
}

#define M_LoadMaskDirect( mm, addr ) __asm{\
__asm movq mm, qword ptr [addr] \
}

I timed the results with the no-hexdigit output version of (as at the end of the previous thread):

   59,662  Direct movq mmX, addr
   57,074  using punpklbw
   56,206  using psllq

So, it appears that the conciseness of the original code wins out ... perhaps because the masks stay in L1 cache after the first pass.
=-=-=-=-=-=-=-=-=-=-
For fun, I also tried some experiments to see if misaligning the masks made a difference, for instance,

unsigned char xTest[9]= {0x00, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F, 0x0F};
...
movq  mm4, qword ptr [xTest+1]

The effect, if any, was minimal (or I need a better timer).  Nevertheless, I recommend using:

LONGLONG mask0F=  0x0f0f0f0f0f0f0f0f;
LONGLONG mask30=  0x3030303030303030;
LONGLONG mask39=  0x3939393939393939;
LONGLONG mask07=  0x0707070707070707;

Just because it is prestigious (i.e., way cool) to use 64-bit datatypes when doing 64-bit computing.

-- Dan
0
 
LVL 14

Expert Comment

by:AvonWyss
Comment Utility
Dan, excellent summary of yours... I see that you're also good at writing in human languages, not only computer languages... ;-)

abel, I had not yet had the time to check on what commands the AMD processor works. Sorry about that, but I haven't forgotten.

Dan, regarding optimization: I believe that the code which was generated by the C compiler out of Pavlik's code was already pretty good. And I fully agree that squeezing cycles out of existing code looks pretty much like the ideal job for a program. That said, I must admit that I have not got much farther than to interleave opcodes somewhat, as you can see.
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Before I continue posting more optimizations, I would like to make the following statement first:

The original challenge was creating an asm handoptimized function outputting a converted 16 bytes of memory as 64 bytes of human readable hexadecimal string. We've come up with basically three approaches:

- a C(++) version (Pavlik) using a lookup table,
- an asm version using x86 assembler
- and an MMX version using the Pentium Pro and up MMX extensions.

Currently this last version also uses some functions introduced with SSE. A fourth approach is requested: using SSE2 with XMM 128 bit registers.

I think those four approaches can not be compared in a fair and equal way. Using a lookup table makes more use of memory then the calculating versions. Comparing them means comparing apples and pears. The algorithm itself should be compared and the memory accesses should be calculated out.

Comparing the calculated versions is obviously easier, but comparing an MMX version against an x86 version is simply not fair. I'd rather compare x86 versions against x86 versions and MMX versions against MMX versions. This gives both approaches the deserved appraisal and as a market-approach it's fairer: if you include an MMX algorithm, you should also include the x86 algorithm, simply for those systems that don't have MMX extensions on their machines. The same applies to the SSE2/EMMX approach, or using any special extensions of other modern cpus.

As an extreme example, anyone could by a preliminary version of the itanium processor, enabling software development of 64 bit applications. I reckon that even the first trial 64 bit implementation of this algorithm will beat us all easily. That does not make the algorithm smarter, nor does it make the other algorithms slower. Simply because the cannot be interchanged.

Just for a test, I would like to use all three approaches on an old Pentium Pro machine having 33 Mhz memory. I wonder if Pavlik is still the fastest. And using an 80486? The only compatable code will be Pavlik's and Avonwyss's. Does anyone want to make a bet on who's gonna be the winner then?

I just want to put all these brilliant algorithms in a new light. The deserve more, I think.

Cheers!
Abel
0
 
LVL 14

Expert Comment

by:AvonWyss
Comment Utility
Abel, a very good statement of yours! Especially the use of more-or-less proprietary extensions does make a comparison useless except to come to the conclusion that the extentsions do enable some kind of calculations to be performed faster than the "normal" way.

Does anyone still have an old machine lying around, like a 486 or so, to do the tests? The slowest available at my place (but not even plugged in *g*) is a Pentium 100 non-MMX with EDO ram modules.
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
I do have an Pentium 75, not plugged into the network, but I can use those lovely 3.5" discs, can't I? I even happen to have two other systems that I can use, but that will take more time because they are not mine, but I can borrow them: one 486/33 and even one 286 laptop! But that last one won't be of much use: there is no 16 bits algorithm around here ;)

Does anybody know of a table of assembly instructions and which were when included? Strangely enough, Intel does not seem to provide them anymore in there regular manuals, or I can't seem to find them.

Btw, do you know that the printed Pentium manuals will be shipped to you free of charge? I did not even have to pay the s&h!

Abel
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
I agree about the comparing apples to orangutans thing...

But stepping back a bit from the actual challenge... I think this thread is really about exploring the ins and outs of using ASM to get really Really REALLY fast results (and of course, the pleasure of exploration and discovery).  

I'm not so sure that there is much to gain by finding (for instance) the fastest 286-only version of a hex converter -- even though that would be an interesting puzzle.

We are not bound by some whip-cracking boss saying "this must run on IE4" or "make it compatible with my Palm Pilot"  Here, we can take flight into new realms.

How about making this 'implicit' in the challenge: Somewhere there is a CPUID test or whatever that selects from among the best of these routines for the platform that is currently active.

I have no qualms about awarding points for the best non-MMX code or the best non-lookup, non-MMX solution.  I won't complain if we spend time in that query-space (heck, I might enjoy contributing).  But I'd like to see us move mainly forward, not backward.

-- Dan
(opinions expressed here are not my own; they are posted only to increase confusion and to promote brand-name recognition)
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
>  'implicit'
You mean like doing as if it's it's already there? Ok :)

> whip-cracking boss
These asm manuals sometimes sure feel like whips to me

> But I'd like to see us move mainly forward, not backward.
Yes, but I'm still curious about the performance on differently configured machines.

And talking about these F16s going into those new realms: you'd probably like to use the most incredibly configured machine available, just to see if we can make it faster by using new techniques available on it. I still don't have this Itanium machine available, but I do happen to have a dual P500 ready. If anybody is interested on trying his/her hands on a dual-pentium optimized solution, please do so. People say often that a dual-P does not (often) add performance to applications, it sometimes even decreases.

Maybe I can finally utilize its full potential with this even faster then currently fastest algorithm! And about digging, exploring, finding out etc., if anybody knows where to start this same converter at priviledge level 0, go ahead and tell me, then I'll try to get two cpus at work :)

On the other hand, exploring even further what we already know and compare (new) results of new insights would be very interesting. One thing for starters: the PREFETCHxx functions probably have another effect with my way of testing (same buffers all the time) then with your (Dan) way of testing. Did you also gain about 1.5 percent?

Btw, have you optimized your testing methods with using the priority thread and performance timer api?

Vtune: I have it, but I have not used it on these algorithms (yet). I wanted to do it al by hand, as that makes me captain of the ship.

Abel
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
>>where to start this same converter at priviledge level 0,...

Multiprocessors:
I think a multi-threaded version might take advantage of the second CPU without doing anything beyond setting up two threads.  For instance, a HexifyReallyBigBuffer() fn could spin off one thread to hex-convert the first 1MB of data and then call the the same fn to convert the second half of the data.  WinNT might schedule the one thread to run on one CPU and the other on the other.  You can also call SetThreadAffinityMask to force a thread to use a particular CPU.

I did not try PREFETCHxx because my input buffer exceeds normal cache size and presumably the 65-byte output buffer would be in L1 cache after the first or second call anyway.

I added SetThreadPriorty() and Sleep(0) to the code (right before recording the start time) a while back.  I recall seeing an immediate improvement.  Also, it is now harder, but not impossible, to stop the process; i.e, clicking the [Stop debugging] button takes a while to take effect (first time, I thought my PC had frozen).

Vtune:
From the docs, It sounds like it just 'suggests improvements'  Do you know if it has the capability to actually re-order the opcodes?

Manuals:
Call me dense, but it took me a while to figure out the best way to use these in Acrobat reader:  Open up the 'Navigation pane' so you can click on an opcode by name.  I had tried clicking on the opcode name in the Table Of Contents, but that fails with some sort of error.

-- Dan
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
> Do you know if it has the capability to actually re-order the opcodes?
If you use it with Intel's C++ compiler, which works in the MS VS environment, you can use intrinsic functions instead of the assembler codes. I know that those intrinsic functions will be optimized in several ways, including reordering.

>to use these in Acrobat reader
You can get the paper/book version for free (they even ship it priority). I like it much more then the acrobat version.

> presumably the 65-byte output buffer would be in L1 cache.
I should say so too, but I was really able of measuring a performance profit by using it. And if you are right, then after 400,000 loops or so this performance profit should not show up anymore. Yet still it remained the same percentage.

Multiprocessors:
Yes, it would be interesting to see what that already does, though it is not asm :)  I also remember that NT/Win2k does not use a very good algorithm to spread the threads. I once compared it with BeOS and found out that BeOS was indeed quicker then Win2k, but then again, these are the monkeys against the elephants: hard to compare.

Abel (bedtime...)
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
[I can't get my comment to be added to the database. Trying in smaller parts now:]

Back on the subject at hand. I have a new Pavlik version, quite ugly and using huge amounts of memory, but almost twice as fast as my MMX routine! It'd be interesting to see what happens on different machines, because slow memory access or not enough physical memory probably causes this algorithm to fall way back. If you're going to test it, free physical memory of about 142Mb would be helpfull.

I know this isn't asm, but using the VC++ optimizing to the max it's faster then what we did by hand. (tested on 800Mhz, single processor, 256 Mb Rambus memory).
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
[part 2: no success yet, retry:]


//SW = starts with
#pragma pack(1)
struct SW_BYTE
{
     unsigned short byte_1;
     unsigned char space_1;
     unsigned short byte_2;
     unsigned char space_2;
     unsigned short byte_3;
};
struct SW_SPACE
{
     unsigned char space_1;
     unsigned short byte_1;
     unsigned char space_2;
     unsigned short byte_2;
     unsigned char space_3;
     unsigned char nibble;
};
struct SW_NIBBLE_DASH
{
     unsigned char nibble;
     unsigned char space_1;
     unsigned short byte_1;
     unsigned char space_2;
     unsigned short byte_2;
     unsigned char dash;
};
struct SW_NIBBLE_SPACE
{
     unsigned char nibble;
     unsigned char space_1;
     unsigned short byte_1;
     unsigned char space_2;
     unsigned short byte_2;
     unsigned char space_3;
};
union CHARS_TYPE
{
     struct
     {
          unsigned char char_1;
          unsigned char char_2;
     }duo;
     unsigned short disp_val;
};

union MERGE_
{
     struct{
          CHARS_TYPE     duo_1;
          CHARS_TYPE     duo_2;
          CHARS_TYPE     duo_3;
          CHARS_TYPE     duo_4;
     }chars;
     SW_BYTE               byte;
     SW_SPACE               space;
     SW_NIBBLE_DASH          nibbledash;
     SW_NIBBLE_SPACE     nibblespace;
     __int64               disp_val;
};

const unsigned int FLU_MAX_HI = 256 * 256 * 256;
const unsigned int FLU_MAX_LO = 256 * 256 * 16;
const unsigned int CHARS_MAX  = 256 * 256;
static MERGE_          sw_byte[FLU_MAX_HI];
static MERGE_          sw_space[FLU_MAX_LO];
static MERGE_          sw_nibble_dash[FLU_MAX_LO];
static MERGE_          sw_nibble_space[FLU_MAX_LO];
static CHARS_TYPE     chars_lookup[CHARS_MAX];
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
[part 3]

void fillArrays(void)
{
     for(unsigned int i = 0; i < 256; i++)
     {
          for(unsigned int j = 0; j < 256; j++)
          {
               for(unsigned int k = 0; k < 256; k++)
               {
                    unsigned int cnt = (i<<16)|((j<<8)|k);
                    sw_byte[cnt].byte.byte_1 = lookup[k];
                    sw_byte[cnt].byte.byte_2 = lookup[j];
                    sw_byte[cnt].byte.byte_3 = lookup[i];
                    sw_byte[cnt].byte.space_1 = '\x20';
                    sw_byte[cnt].byte.space_2 = '\x20';
                    if(i < 16)
                    {
                         unsigned int cnt = (k<<12)|((j<<4)|i);
                         sw_nibble_dash[cnt].nibbledash.byte_1 = lookup[j];
                         sw_nibble_dash[cnt].nibbledash.byte_2 = lookup[k];
                         sw_nibble_dash[cnt].nibbledash.nibble = lookup[i] >> 8;
                         sw_nibble_dash[cnt].nibbledash.space_1 = '\x20';
                         sw_nibble_dash[cnt].nibbledash.space_2 = '\x20';
                         sw_nibble_dash[cnt].nibbledash.dash = '\x2d';

                         sw_nibble_space[cnt].nibblespace.byte_1 = lookup[j];
                         sw_nibble_space[cnt].nibblespace.byte_2 = lookup[k];
                         sw_nibble_space[cnt].nibblespace.nibble = lookup[i] >> 8;
                         sw_nibble_space[cnt].nibblespace.space_1 = '\x20';
                         sw_nibble_space[cnt].nibblespace.space_2 = '\x20';
                         sw_nibble_space[cnt].nibblespace.space_3 = '\x20';
                    }
               }
               for(unsigned char k = 0; k < 16; k++)
               {
                    unsigned int cnt = (k<<16)|((j<<8)|i);
                    sw_space[cnt].space.byte_1 = lookup[i];
                    sw_space[cnt].space.byte_2 = lookup[j];
                    sw_space[cnt].space.nibble = lookup[k] >> 8;
                    sw_space[cnt].space.space_1 = '\x20';
                    sw_space[cnt].space.space_2 = '\x20';
                    sw_space[cnt].space.space_3 = '\x20';
               }
               unsigned int cnt = (i<<8)|j;
               chars_lookup[cnt].duo.char_1= chars[j];
               chars_lookup[cnt].duo.char_2 = chars[i];
          }
     }
}
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 39

Expert Comment

by:abel
Comment Utility
0
 
LVL 1

Expert Comment

by:Moondancer
Comment Utility
Sorry you have had problems; our Engineering team is working diligently to correct the problem and intermittent outages you've experienced.  Working on the solution as I type this for Milestone Release 3.0.1, noted in "What's New".
Your patience and understanding is appreciated.

The NO TEXT problem "could be" the result of trying to paste too much code at once; or pasting formatted text from a word processor (if so, convert via Notepad first to strip formatting before pasting here).  If this continues, let me know in the CS Q in which I responded moments ago please.

:)
Moondancer - EE Moderator
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
I can't tell yet, but are you pre-calculating output for every possible 4-byte value?  A lookup table on the order of 200 MB?  That is either ridculous or sublime or both.  In any case it is outside of the box so I LIKE it.

Surely there is a middle ground...

-- Dan
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
[part 4 (third try)]

//this is where it happens ;)


void Abel_C_HexLineOut2(unsigned char* pSrc, char* pszBuf )
{
     static bFirstTime = true;
     if(bFirstTime)
     {
          fillArrays();
          bFirstTime = false;
     }
     //first, three bytes (0, 1, 2)
     *(__int64*)pszBuf = sw_byte[(*(unsigned int *) pSrc) & 0x00ffffff].disp_val;
     *(__int64*)(pszBuf+9) = sw_byte[(*(unsigned int *) (pSrc+3)) & 0x00ffffff].disp_val;
     *(__int64*)(pszBuf+18) = sw_byte[(*(unsigned int *) (pSrc+6)) & 0x00ffffff].disp_val;
     *(__int64*)(pszBuf+27) = sw_byte[(*(unsigned int *) (pSrc+9)) & 0x00ffffff].disp_val;
     *(__int64*)(pszBuf+36) = sw_byte[(*(unsigned int *) (pSrc+12)) & 0x00ffffff].disp_val;
     *(short *)(pszBuf+45) = lookup[(*(unsigned char *) (pSrc+15))];

     //the string-part
     //This appeared to be faster then 4- or 8-byte writes
     *((unsigned short *) (pszBuf+48)) = chars_lookup[*(unsigned short *) (pSrc)].disp_val;
     *((unsigned short *) (pszBuf+50)) = chars_lookup[*(unsigned short *) (pSrc+2)].disp_val;
     *((unsigned short *) (pszBuf+52)) = chars_lookup[*(unsigned short *) (pSrc+4)].disp_val;
     *((unsigned short *) (pszBuf+54)) = chars_lookup[*(unsigned short *) (pSrc+6)].disp_val;
     *((unsigned short *) (pszBuf+56)) = chars_lookup[*(unsigned short *) (pSrc+8)].disp_val;
     *((unsigned short *) (pszBuf+58)) = chars_lookup[*(unsigned short *) (pSrc+10)].disp_val;
     *((unsigned short *) (pszBuf+60)) = chars_lookup[*(unsigned short *) (pSrc+12)].disp_val;
     *((unsigned short *) (pszBuf+62)) = chars_lookup[*(unsigned short *) (pSrc+14)].disp_val;
}


void Abel_C_HexLineOut(unsigned char* pSrc, char* pszBuf )
{
     static bFirstTime = true;
     if(bFirstTime)
     {
          fillArrays();
          bFirstTime = false;
     }

     unsigned int * pSrcTmp = (unsigned int *) pSrc;
     unsigned __int64 * pBufTmp = (unsigned __int64 *) pszBuf;
     
     //first, three bytes (0, 1, 2)
     *pBufTmp = sw_byte[(*pSrcTmp) & 0x00ffffff].disp_val;
     
     //second, two bytes plus nibble (3, 4, 5)
     unsigned int tmp = (*((unsigned int *) (pSrc+3)) & 0x00ffffff);
     tmp = ((tmp & 0x00f00000) >> 4) | (tmp & 0x0000ffff);
     *(pBufTmp+1) = sw_space[tmp].disp_val;

     //third, nibble plus two bytes (5, 6, 7), including dash
     tmp = (*((unsigned int *) (pSrc+5)) & 0x00ffffff);
     tmp = (tmp & 0xf) | (tmp >> 4 & 0x000ffff0);
     *(pBufTmp+2) = sw_nibble_dash[tmp].disp_val;

     //fourth, three bytes (8, 9, 10) just after the dash
     *(pBufTmp+3) = sw_byte[(*(pSrcTmp+2)) & 0x00ffffff].disp_val;

     //fifth, two bytes plus nibble (11, 12, 13)
     tmp = (*((unsigned int *) (pSrc+11)) & 0x00ffffff);
     tmp = ((tmp & 0x00f00000) >> 4) | (tmp & 0x0000ffff);
     *(pBufTmp+4) = sw_space[tmp].disp_val;

     //sixth, nibble plus two bytes (13, 14, 15), including space
     tmp = (*((unsigned int *) (pSrc+13)) & 0x00ffffff);
     tmp = (tmp & 0xf) | (tmp >> 4 & 0x000ffff0);
     *(pBufTmp+5) = sw_nibble_space[tmp].disp_val;

     //This appeared to be faster then 4- or 8-byte writes
     *((unsigned short *) (pszBuf+48)) = chars_lookup[*(unsigned short *) (pSrc)].disp_val;
     *((unsigned short *) (pszBuf+50)) = chars_lookup[*(unsigned short *) (pSrc+2)].disp_val;
     *((unsigned short *) (pszBuf+52)) = chars_lookup[*(unsigned short *) (pSrc+4)].disp_val;
     *((unsigned short *) (pszBuf+54)) = chars_lookup[*(unsigned short *) (pSrc+6)].disp_val;
     *((unsigned short *) (pszBuf+56)) = chars_lookup[*(unsigned short *) (pSrc+8)].disp_val;
     *((unsigned short *) (pszBuf+58)) = chars_lookup[*(unsigned short *) (pSrc+10)].disp_val;
     *((unsigned short *) (pszBuf+60)) = chars_lookup[*(unsigned short *) (pSrc+12)].disp_val;
     *((unsigned short *) (pszBuf+62)) = chars_lookup[*(unsigned short *) (pSrc+14)].disp_val;
}
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Sorry for all those mis-postings. I seems that it's finally working again. Thanks Moondancer, for your timely comment.

> That is either ridculous or sublime or both
Definitely ridiculous, absolutely unusable in a "real world" app, but I wanted to give it a try ;)

I wanted to post a comment about how it all works and what it is that I'm exactly doing, but EE did not respond for quite a long time. And big comments (all four parts at once) did not seem to work at all. You should use the code with the lookup-tables provided by Pavlik. I only moderated the name of "digits[256]" to "lookup[256]" because I thought I had to alter the table. But it appeared to work correctly and as I expected without modification.

There are three functions. One does the one-off lookup+storing of the tables (takes 142 megabytes) and the other two are two different implementations and approaches of the algorithm. The Abel_C_HexLineOut function is the fastest version I created that also writes the spaces (it happens to be faster then including an _asm segment with MMX calculation for the stringized part). The Abel_C_HexLineOut2 function assumes the spaces are in place. It is about 10-12 percent faster than Abel_C_HexLineOut.

Next comment I explain a little.

Cheers
Abel
0
 
LVL 14

Expert Comment

by:AvonWyss
Comment Utility
Well, IMHO, the SNR is pretty bad with such a huge table... ;-)
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
> are you pre-calculating output for every possible 4-byte value?
I cannot do that. That would yield (assuming 12 bytes for the possible outputstrings) 256*256*256*256*12 = 51539607552 bytes = 48 GB! I don't have that much of available memory, not many systems do.

The idea for this approach was simple: what is the easiest way to write quadwords into memory, if possible, also alligned on the quadwords. I will explain the code in three parts:
loading the data, plus what we need and why;
second, the first approach to the lookup code, Abel_C_HexLineOut;
and last but not least, an easier, yet unaligned version of the algorithm which can be faster under certain circumstances, Abel_C_HexLineOut2

But before I kick off, let me emphasize again that this algorithm will be totally unusable if parts of the memory are paged! I know it is possible to lock memory to be unpageable, but I found that a little to far off. Just make sure you run it on a system that has enough free physical memory.

Discussing the lookup tables
---------------------------------------
If you cut the target string in pieces of 8 bytes, you'll notice that there are four distinct types of string in it, excluding the 16 bytes at the end. The formats are (for visibility at EE I included two spaces where one is requested):
"xx  xx  xx"      SW_BYTE
"  xx  xx  x"      SW_SPACE
"x  xx  xx-"      SW_NIBBLE_DASH
"x  xx  xx  "      SW_NIBBLE_SPACE

I named these after the character it starts (and sometimes ends) with. The total string is formed as:
SW_BYTE + SW_SPACE + SW_NIBBLE_DASH + SW_BYTE + SW_SPACE + SW_NIBBLE_SPACE.

All types have a length of eight bytes or characters. The number of possibilities, resulting in a number of storage bytes for the lookuptables, is as follows:
SW_BYTE = 256 * 256 * 256 * 8 = 128 MB
all others = 256 * 256 * 16 * 8 = 8 MB

Total memory = 128 + 3 * 8 = 142 MB

When I saw this at first I thought, this is foolish! This will result in a very slow algorithm. And no, the algorithm is not slow, only the timing will depend highly on the type of memory used.

To make the code understandable I decided to make some human readable datatypes, instead of simply different char arrays of the data. The union makes it easy to convert the datatypes to __int64, which is used to write 8 bytes at once to memory.

Creating the lookup tables
---------------------------------------
Why a PRAGMA PACK(1)? Simply because __cdecl(_align(1)) did not work on my VC++ compiler. There still was an alignment problem: the types were 10 bytes instead of the required 8 bytes and as a result they were padded with nulls. PRAGMA PACK(1) packs the types on a one-byte boundary alignment which is what I wanted. If you use a different compiler then VC++ you may need a different pragma directive.

The types are self-explanatory I think. They are just exactly as how it is printed. The union MERGE_ uses dispval to write to memory. The struct chars in MERGE_ can be ignored, this is not used (it was too slow).
The union CHARS_TYPE is used to write the string-part in chunks of two-bytes. This was simply the only workable solution I could think of.

I think that looping and filling the tables is pretty straightforward. I set the byte-order by experimenting. Maybe someone can explain it to me. It has something to do with the different byte order of a string an a numeric value (__int64 in this case).



to be continued...
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
> Well, IMHO, the SNR is pretty bad with such a huge table... ;-)
I think I agree. It's ugly, yet funny and it has remarkable results. What is an SNR?
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
I'll be a bit shorter from now on ;-)


Abel_C_HexLineOut
----------------------
The basic lookup system is simple: just take the value of the source at a certain location and use that as an offset. That's pretty much the same as Pavlik's algorithm. There were some troubles with the nibbled parts. The nibble I need to convert is not on the right spot (the high nibble must become a low nibble or v.v. to be in the right position), so I needed to take the right nibble out, shift it and place it back. You see that happening on four of the six lookup places.
Note that the source and the destination are converted to be aligned differently to make getting and storing the values easier. I don't know if it is faster or slower to do so.

I forgot to mention the fifth lookup table in my previous comment. It is (again) straightforward. All 65536 possible combinations of two bytes are stored in chars_lookup. This is then used to get the respective values using (again) the input data in chunks of two bytes as offset.

I see now that I forgot to include the termination null character. It should be there of course. The last statement should be:
*(pszBuf+64) = '\0';


Abel_C_HexLineOut2
----------------------
This is merely a brute-force approach (isn't this all?). Without trying to be elegant or smart, I simply write three-byte values, interleaved with a one-byte offset, which should already contain a space. This only requires the sw_byte lookup table and is much easier to understand then the prev. function, but has the disadvantage of not writing the spaces. If you do write the spaces, it is quite a bit slower. In that case Abel_C_HexLineOut has the preference.


Well, there's not much more to say about this strange and silly algorithm. Probably in a (not so) far future, where every system has at least 64 GB of internal memory accessable with an 10 GHz bus this will be a workable solution. But till then, I understand the criticisms about the code and its redicilousness.

But wasn't the original idea to create the fastest algorithm possible?

Btw: I'm very curious about other timings. They may differ significantly of course.

Cheers!
Abel
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
SNR? -> Signal to Noice Ratio? Yep, you are right of course ;)
0
 
LVL 14

Accepted Solution

by:
AvonWyss earned 100 total points
Comment Utility
SNR = Signal (to) Noise Ratio :-)
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
Have you checked the ASM code generated by VC++ compiler?  I'm seeing
     
*(__int64*)pszBuf= n64.disp_val;
0040221A  mov         eax,dword ptr [n64 (5BFD60h)]
0040221F  mov         dword ptr [pszBuf (5BFD70h)],eax
00402224  mov         ecx,dword ptr [n64+4 (5BFD64h)]
0040222A  mov         dword ptr [pszBuf+4 (5BFD74h)],ecx

I have the "Optimize for Processor" at "Pentium Pro and Above" but I see no loads/stores of MMX regs.  The VS docs seem to imply that MMX opcodes can be used only in _asm blocks.

Thus, it appears that with some asm, you could significantly enhance the speed of this monster from the seventh circle of hell :)

I haven't done a timing test.  On my 196MB machine, I'd probably be page faulting enough to make a grown man weep.

I've been thinking about a middle ground.  Looking at four-byte output, we have

xx_x
x_xx
_xx_  (or _xx- )

making for 12 bit indexes: three (or four) tables of 4096 elements, each 4 bytes in length.  So the lookup tables amount to 4096*4*4 = 64K of lookup data.

Now each output line requires 12 lookups rather than Pavlik's 16 or your 6, so one would expect performance somewhere in the middle.  HOWEVER, isn't it possible that the entire 64K of lookup data could end up in L1 or at least L2 cache on most systems?  If so, that might make up for the extra opcodes.

I was rather disappointed in my results when I tried to speed up Pavlik's code by using MMX.  But since then I have learned a few things.  I just noticed SHLD/SHRD which lets you move 4-bit chunks out of one reg32 and into another efficiently.  I wish there were a way to move two reg32s directly into an mmx.  

Anyway, if I have some time, I might give the 12-bit algorithm a try tomorrow.

-- Dan
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
I gave up on 12-bit processing... too much hassle.  I found a 'middler' middle ground with this

   512KB lookup table

technique.  The speed is quite good, probably faster than any except able's last one (which I did not test).

I am getting just over
   61,000 conversion per second on the first run and...
   70,000 conversion per second on subsequent runs

The difference is that the table needs to be filled on the first run.  I'm using able's MMX ASCII output routine, which seems to be the fastest.

One thing that is a little surprising... I am outputing more bytes than necessary... that is, only the first 6 bytes of the 8-byte table entries are used, so the 64-bit writes overlap each other.  Thus I an writing 16 bytes more than necessary, but doing so eliminates a lot of data manipulation.  Also, the writes are not 64-bit aligned, but *are* 16-bit aligned, which may be providing some mitigation.

The 512KB table has padding so that I can do 8-byte aligned reads AND it makes calcluating offset much faster.

-- Dan
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
typedef struct {
     BYTE chars[8];
} EightBytes;

EightBytes xx_xx_00[65536];  // allocate 512K

char hexDigits[]= "0123456789ABCDEF";
void SetupTables()
{
     EightBytes* p= xx_xx_00;
     for( int j=0; j<65536; j++ ) {
          p->chars[ 1]= hexDigits[ (j >> 0 ) & 0x000f ];
          p->chars[ 0]= hexDigits[ (j >> 4 ) & 0x000f ];
          p->chars[ 2]= ' ';
          p->chars[ 4]= hexDigits[ (j >> 8 ) & 0x000f ];
          p->chars[ 3]= hexDigits[ (j >>12 ) & 0x000f ];
          p->chars[ 5]= ' ';
          p->chars[ 7]= 0;
          p->chars[ 6]= 0;
          p++;
     }
}

//------------------------------ input in mm0
#define M_ProcessHex8 __asm{\
__asm M_PEXTRW(eax_mm0,0) \
__asm movq mm2, [ ebx + eax*8 ] \
__asm M_PEXTRW(eax_mm0,1) \
__asm movq mm3, [ ebx + eax*8 ] \
__asm M_PEXTRW(eax_mm0,2) \
__asm movq mm4, [ ebx + eax*8 ] \
__asm M_PEXTRW(eax_mm0,3) \
__asm movq mm5, [ ebx + eax*8 ] \
__asm movq [edi],    mm2 \
__asm movq [edi+6],  mm3 \
__asm movq [edi+12], mm4 \
__asm movq [edi+18], mm5 \
}

//------------------------------- input at [esi]
#define M_ProcessAscii16 __asm{\
__asm movq     mm3, qword ptr [mask1f] \
__asm movq     mm4, qword ptr [mask2e] \
__asm movq     mm0, [esi] \
__asm movq     mm1, mm0 \
__asm pcmpgtb  mm0, mm3 \
__asm movq     mm2, mm0 \
__asm pandn    mm0, mm4 \
__asm pand     mm1, mm2 \
__asm paddb    mm1, mm0 \
__asm movq     qword ptr [edi+48], mm1 \
__asm movq     mm0, [esi+8] \
__asm movq     mm1, mm0 \
__asm pcmpgtb  mm0, mm3 \
__asm movq     mm2, mm0 \
__asm pandn    mm0, mm4 \
__asm pand     mm1, mm2 \
__asm paddb    mm1, mm0 \
__asm movq     qword ptr [edi+56], mm1 \
}

void HexLineOut(unsigned char * pSrc, char * pszDest)
// void HexLineOut512K(unsigned char * pSrc, char * pszDest)
{
     static BOOL fTablesOK= FALSE;

     if ( !fTablesOK ) {
          SetupTables();
          fTablesOK= TRUE;
     }

     _asm {
     mov esi, pSrc
     mov edi, pszDest
     lea ebx, xx_xx_00

     movq mm0, [esi]
     M_ProcessHex8

     movq mm0, [esi+8]
     add edi, 24
     M_ProcessHex8

     sub edi, 24
     mov  byte ptr [edi+23], '-'

     M_ProcessAscii16
     mov  byte ptr  [edi+64], 0x00

     EMMS
     }
}
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Hi Dan,

It's funny to see how two people can work on the same idea and even work it out similarly. My code only looks differently in the uses of literals ;)

But I also created a version not using PEXTRW, but reading the words directly into eax. I don't know why, but it appeared to be faster then using PEXTRW! Or so I thought. It seems that my system cheats on me sometimes. Just a few builds later (and still the same code) they measured exactly the same! Below is the non-PEXTRW code.

I also tried my hands on a version that uses this three-bytes lookuptable but writes only in quadword chunks, but I burnt them (my hands, I mean). Too much opcodes and way too slow.

I liked to see that I was smarter then the VC optimization technique. I use VC++.NET, but just as the 6.0 version, this does one does not optimize using MMX registers. You were right about that, and that is also our gain above the machine. I don't now about VTUNE yet.


//Just the part that does the lookup. two_bytes is my lookup table,
//but is has (internally) the same format as yours.

          movzx     ecx, word ptr [esi]
          movq     mm0, dword ptr [ecx*8+two_bytes]
          movq     dword ptr [edi], mm0

          movzx     ecx, word ptr [esi+2]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+6], mm0

          movzx     ecx, word ptr [esi+4]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0xC], mm0

          movzx     ecx, word ptr [esi+6]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0x12], mm0

          movzx     ecx, word ptr [esi+8]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0x18], mm0

          movzx     ecx, word ptr [esi+0xA]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0x1E], mm0

          movzx     ecx, word ptr [esi+0xC]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0x24], mm0

          movzx     ecx, word ptr [esi+0xE]
          movq     mm0, qword ptr [ecx*8+two_bytes]
          movq     qword ptr [edi+0x2A], mm0

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
On the 12 bits version: I think you'd end up having too much code for extracting the right nibbles each time. It cost me quite some opcodes in the huge-table version and you'd have to double that for the 12 bits version.

Forgot to say: I till use a PREFETCHT0 instruction for edi. I don't know if it helps much, but if you got different timings, you know were it might come from.

Cheers
Abel
0
 
LVL 49

Author Comment

by:DanRollins
Comment Utility
This thread seems to have run its course.  I have set some points to this 'continuation Q' and accepted AvonWyss' comment as an answer in appreciation of the fine work he has done and what the heck I don't have anything else to do with the points.  I especially like your ASCII filter your clear explanation of how it works.  My mind just doesn't work that way!  Lord help me, I LOVE this stuff!

I am also posting some points to able and have previously posted points for Pavlik.  Thanks you one and all.  Lets do it again sometime!  I made a new challenge here:

   http:Q.20288519.html

-- Dan
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Help! Bomb phase 6 3 975
Buffer Overflow Bomb Assignment - Final Phase 3 1,558
Wondows Handles 4 404
z390 Mainframe Assembly Code Programming 21 913
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
HOW TO: Connect to the VMware vSphere Hypervisor 6.5 (ESXi 6.5) using the vSphere (HTML5 Web) Host Client 6.5, and perform a simple configuration task of adding a new VMFS 6 datastore.
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now