[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 343
  • Last Modified:

Can this be optimized? If so, how?

in the following code, pPacket->Buffer() returns a pointer to a BYTE* of size 1000000  (an 8bit grayscale bitmap)

I'm converting it to 24bits (just copying all values red=green=blue), and this is happening 10-15 times per second. Can it be optimized at all?


             int iSize = pPacket->BufferSize();   // equals 1000000 in this case
                int iNewSize = iSize *3;
            BYTE* pBuffer = (BYTE*)pPacket->Buffer();
     
            BYTE* pMappedBuffer = new BYTE[iNewSize ];
                if (!pMappedBuffer)
                       return 0; // error, out of memory

               memset(pMappedBuffer, 0, iNewSize  );

            int x;
            BYTE* p = pMappedBuffer;
            for( x = 0; x < iSize; x ++)
            {
                  *p++ = pBuffer[x];
                  *p++ = pBuffer[x];
                  *p++ = pBuffer[x];
            }
0
PMH4514
Asked:
PMH4514
  • 6
  • 4
  • 2
  • +5
3 Solutions
 
jkrCommented:
If the buffer size is constant at each call, there's no need to reallocate 'pMappedBuffer' each time, e.g.

static    BYTE* pMappedBuffer = NULL;
static    int iNewSize = 0;

           int iSize = pPacket->BufferSize();   // equals 1000000 in this case
           int iTmpSize = iSize *3;
         BYTE* pBuffer = (BYTE*)pPacket->Buffer();

         if ( iTmpSize != iNewSize){

              // delete buffer if size does not match
              delete [] pMappedBuffer, pMappedBuffer = NULL
         }
     
         if ( !pMappedBuffer) {

             pMappedBuffer = new BYTE[iNewSize ];
               if (!pMappedBuffer)
                      return 0; // error, out of memory
        }

         //  memset(pMappedBuffer, 0, iNewSize  ); not needed, since everything will be overwritten
 
         int x;
         BYTE* p = pMappedBuffer;
         for( x = 0; x < iSize; x ++)
         {
              *p++ = pBuffer[x];
              *p++ = pBuffer[x];
              *p++ = pBuffer[x];
         }
0
 
jkrCommented:
Ooops, there's an important detail missing, that shoould be

         if ( iTmpSize != iNewSize){

             iNewSize = iTmpSize; // <---

             // delete buffer if size does not match
             delete [] pMappedBuffer, pMappedBuffer = NULL;
        }
0
 
swamp-dogCommented:
I've no specific answer (graphics isn't my area) but at least do what jkr says and use a fixed/one-time allocated buffer. For better performance you'll have to resort to specific OS calls. Both windows & unix will have this functionality buried in their resprective API's somewhere. Those calls will have the advantage of being able to use hardware where possible. I guess you'd want to look at win32 bmp handling, DirectX and/or OpenGL.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
PMH4514Author Commented:
I'm using DirectX to ultimately display the image. I haven't been using it for any kind of processing (I'm not very familiar with what it has to offer in that regard)

jkr - cool, thanks.. The loop itself can't be optimized?
0
 
jkrCommented:
>>The loop itself can't be optimized?

You mean

        int x;
        BYTE* p = pMappedBuffer;
        for( x = 0; x < iSize; x ++)
         {
             *p++ = pBuffer[x];
             *p++ = pBuffer[x];
             *p++ = pBuffer[x];
        }

?

I don't see much room for optimizations here, except rewriting that  in assemply language.

Eliminating 'memse()' for arrays of that size should have a considerable effect alredy, and not reallocating big buffers will hopefully add up a lil' bit also.
0
 
jhshuklaCommented:
intel has MMX instructions that might aid in parallel execution of each entry in loop. but you will have to resort to assembly language and it will be hardware specific. and your compiler must be able to understand those instructions. most modern compilers should be fine. google will be your helper.
0
 
jkrCommented:
>> intel has MMX instructions that might aid in parallel execution of each entry in loop

I was thinking about that also (and still cannot remember the AMD counterpart's name), but that would also mean to blow up the storage to a 64bit granularity, which is/can be kinda heavy load...
0
 
brettmjohnsonCommented:
As has already been mentioned, preallocate the 3x buffer.
You also don't need to fill it with 0s, since you will be filling
it with data from the source buffer.  The real key to speed
in this type of thing is to avoid byte-level access to memory
and to reduce the number of loop iterations.  The former can
be achieved by using natural word-aligned word-sized reads
and writes.  The latter can be achieved with loop-unrolling.
The following fragment of code does both:


//#define BIG_ENDIAN   for BIG_ENDIAN architectures (Sparc, Power, Motorola, etc)
//#define IS64BIT   for 64-bit architectures (technically native 64 bit longs)

/* This leverages processor optimizations when dealing with natural word size
 * data aligned on word sized boundaries as well as some loop unrolling.
 * We read 4 (or 8) bytes at a time from the source buffer and write 12 (or 24) bytes
 * per loop iteration into the destination buffer.  (Since each source byte gets tripled
 * in the destination, we get 3*4=12 (3*8=24) bytes per iteration).  The following code
 * supports 32 or 64 bit little-endian or big-endian architectures.
 */
typedef unsigned long ulong;
ulong *srcBuff = (ulong *)pBuffer;
ulong *dstBuff = (ulong *)pMappedBuffer;
int x;
for( x = 0; x < iSize/(sizeof(ulong)); x ++)
{
  unsigned char src[sizeof(ulong)];
  *(ulong *)src = srcBuff[x];
#ifdef BIG_ENDIAN
  *dstBuff++ = (ulong)src[0] << 24 | (ulong)src[0] << 16 | (ulong)src[0] << 8 | (ulong)src[1];
  *dstBuff++ = (ulong)src[1] << 24 | (ulong)src[1] << 16 | (ulong)src[2] << 8 | (ulong)src[2];
  *dstBuff++ = (ulong)src[2] << 24 | (ulong)src[3] << 16 | (ulong)src[3] << 8 | (ulong)src[3];
#ifdef IS64BIT
  *dstBuff++ = (ulong)src[4] << 24 | (ulong)src[4] << 16 | (ulong)src[4] << 8 | (ulong)src[5];
  *dstBuff++ = (ulong)src[5] << 24 | (ulong)src[5] << 16 | (ulong)src[6] << 8 | (ulong)src[6];
  *dstBuff++ = (ulong)src[6] << 24 | (ulong)src[7] << 16 | (ulong)src[7] << 8 | (ulong)src[7];
#endif
#else
  *dstBuff++ = (ulong)src[1] << 24 | (ulong)src[0] << 16 | (ulong)src[0] << 8 | (ulong)src[0];
  *dstBuff++ = (ulong)src[2] << 24 | (ulong)src[2] << 16 | (ulong)src[1] << 8 | (ulong)src[1];
  *dstBuff++ = (ulong)src[3] << 24 | (ulong)src[3] << 16 | (ulong)src[3] << 8 | (ulong)src[2];
#ifdef IS64BIT
  *dstBuff++ = (ulong)src[5] << 24 | (ulong)src[4] << 16 | (ulong)src[4] << 8 | (ulong)src[4];
  *dstBuff++ = (ulong)src[6] << 24 | (ulong)src[6] << 16 | (ulong)src[5] << 8 | (ulong)src[5];
  *dstBuff++ = (ulong)src[7] << 24 | (ulong)src[7] << 16 | (ulong)src[7] << 8 | (ulong)src[6];
#endif
#endif
}
0
 
_corey_Commented:
>>I don't see much room for optimizations here, except rewriting that  in assemply language.

Since it is a byte pointer, you *couuld* explicitly create a 4-byte cast and do a single assignment.  

I'm unsure if your code would every naturally optimize to a single assigment by the compiler.


>>I'm using DirectX to ultimately display the image. I haven't been using it for any kind of processing (I'm not very familiar with what it has to offer in that regard)

DirectX has color space converters, but until I know more about how you're feeding the data, I'd just say use your transform code for now.

corey
0
 
PaulCaswellCommented:
>>        for( x = 0; x < iSize; x ++)
>>         {
>>             *p++ = pBuffer[x];
>>             *p++ = pBuffer[x];
>>             *p++ = pBuffer[x];
>>        }

I think this could be optimised significantly.

You are putting the same byte into three bytes of the destination. This involves three writes (assuming the compiler isnt optimising). Lets assume a 32bit unsigned long and try to work with that.

The problem is that there are 3 bytes written each time around, this makes it a bit messy.

unsigned long * dst = (unsigned long *)p;
unsigned char * src = &pBuffer[0];
int count = iSize / 4;
for ( x = 0; x < count; x++ )
{
 register unsigned char g1 = *srce++;
 register unsigned char g2 = *srce++;
 register unsigned char g3 = *srce++;
 register unsigned char g4 = *srce++;
 // This might be different depending on the endianness of your system.
 register long v = (g1 << 24) | (g1 << 16) | (g1 << 8) | g2;
 *dst++ = v;
 v = (g2 << 24) | (g2 << 16) | (g3 << 8) | g3;
 *dst++ = v;
 v = (g3 << 24) | (g4 << 16) | (g4 << 8) | g4;
 *dst++ = v;

}

Looking back, I think this is thye same as brett suggested but it demonstrates the technique.

Paul
0
 
PMH4514Author Commented:
interesting stuff..  I'll have to try some of these techniques when I get back to my code on Monday.  

what does register do?

ps. this will always be running on Intel PIV architecture.
0
 
brettmjohnsonCommented:
> ps. this will always be running on Intel PIV architecture.

"ALWAYS."   Ah, the confidence of a young programmer.  Its just as well, as
the 64-bit stuff I posted was just plain wrong.  Here is the 32-bit little-endian
version of the code:


/* This leverages processor optimizations when dealing with natural word size
 * data aligned on word sized boundaries as well as some loop unrolling.
 * We read 4 bytes at a time from the source buffer and write 12 bytes
 * per loop iteration into the destination buffer.  (Since each source byte gets tripled
 * in the destination, we get 3*4=12 bytes per iteration).  The following code
 * supports 32 bit little-endian architectures.
 */
uint32_t *srcBuff = (uint32_t *)pBuffer;
uint32_t *dstBuff = (uint32_t *)pMappedBuffer;
int x;
for( x = 0; x < iSize/(sizeof(uint32_t)); x ++)
{
  unsigned char src[sizeof(uint32_t)];
  *(uint32_t *)src = srcBuff[x];
  *dstBuff++ = (uint32_t)src[1] << 24 | (uint32_t)src[0] << 16 | (uint32_t)src[0] << 8 | (uint32_t)src[0];
  *dstBuff++ = (uint32_t)src[2] << 24 | (uint32_t)src[2] << 16 | (uint32_t)src[1] << 8 | (uint32_t)src[1];
  *dstBuff++ = (uint32_t)src[3] << 24 | (uint32_t)src[3] << 16 | (uint32_t)src[3] << 8 | (uint32_t)src[2];
}

0
 
PaulCaswellCommented:
>>what does register do?
It 'encourages' the compiler to preserve this variable in a register. Note the wording, it does REQUIRE the compiler to do this. I.E. There is no guarantee that it WILL be in a register.

brett,

Just as a matter of interest, what's going on here? This seems a strange technique:

  unsigned char src[sizeof(uint32_t)];
  *(uint32_t *)src = srcBuff[x];

whats wrong with:

unsigned char * src = (unsigned char *)&srcBuff[x];

Paul
0
 
gkatzCommented:
Here's a completely different method that might be faster depending on how you are using the grayscale image.  This will probably only be faster if the image is already loaded into video card memory as a texture and you probably want it to remain in vidoe card memory.  In other words you've got an image on the screen and you want to change it from grayscale to something else.  
 
   If the image exists on the video card then a large portion of your time will not be spend in that function, but in moving the data from the graphics card and back again.  A way to avoid this is to use rendertexture by Mark Harris.  It can be downloaded at http://www.markmark.net/misc/rendertexture.html  Basically what you can do is render the image onto a 2d quad inside of a rendertexture call.  Render texture will stop the image from being displayed on the screen and will provide a faster writing method than the opengl call CopyToTexture.  If you setup rendertexture to use a 24bit texture then your image will be converted over to 24bits.  Also since this is being done on the graphics card then it will be faster than the CPU because it will be done in parallel.  GPU (Graphics Processing Units) have upto 16 pipes working in parallel, processing hundreds of pixels at a time (they switch to a new pixel while waiting for memory reads).  

  This is a much different process than your original code but should give you the speed required if your code meets the conditions above.  You should have decent knowlege of the openGL pipeline if you want to understand and use this method.  

Hope it helps,
   Gkatz
0
 
PMH4514Author Commented:
Ok, a few comments now that I'm trying some of these techniques.

JKR - I like your thought to not reallocate the buffer each time. I did not include in my first post (guess I should have) - that after I map the existing pPacket->Buffer() to 24bit, I then do this:

  // delete the old 8bit buffer
  delete[] pPacket->Buffer();
  pPacket->m_pBuffer = NULL;

  // set my buffer pointer to the address of the mapped buffer
  pPacket->m_pBuffer = pMappedBuffer;

  pMappedBuffer = NULL;


Because it's this new "mapped buffer" that I need to "travel further down the line" along with pPacket, is why I think I was creating a new allocation each time - because that new allocation is what my "packet" then owned. (Please forgive me if that was an important bit of info for everybody's responses, recall my thought when I posted was that the optimization would be in the loop, which I now know is only partly the case)

That said, for the time being I'm leaving the new allocation in place to try these other techniques.

As written, my function averaged 81.3 ticks for execution (for 19 iterations). (there are a couple other steps I take after copying the buffer, now that I know there is probably much room for optimization, I'm going to post the entire function at the end here, as written before any of your above suggestions. I'm also going to up the points on the question.)


Paul/Brett -  keeping in mind that at this point I am still allocating pMappedBuffer each time.  Brett provided this loop:

uint32_t *srcBuff = (uint32_t *)pBuffer;
uint32_t *dstBuff = (uint32_t *)pMappedBuffer;
int x;
for( x = 0; x < iSize/(sizeof(uint32_t)); x ++)
{
  unsigned char src[sizeof(uint32_t)];
  *(uint32_t *)src = srcBuff[x];
  *dstBuff++ = (uint32_t)src[1] << 24 | (uint32_t)src[0] << 16 | (uint32_t)src[0] << 8 | (uint32_t)src[0];
  *dstBuff++ = (uint32_t)src[2] << 24 | (uint32_t)src[2] << 16 | (uint32_t)src[1] << 8 | (uint32_t)src[1];
  *dstBuff++ = (uint32_t)src[3] << 24 | (uint32_t)src[3] << 16 | (uint32_t)src[3] << 8 | (uint32_t)src[2];
}


And the function averaged 21 ticks for 19 iterations. Which is approx. 75% improvement!  Paul then suggested "as a matter of interest" substituting this:

  unsigned char src[sizeof(uint32_t)];
  *(uint32_t *)src = srcBuff[x];


with this:

  unsigned char * src = (unsigned char *)&srcBuff[x];

I made that change, and the function averaged 38 ticks for 19 iterations. Interestingly, the tick count did range from 16-140. Brett's original suggestion, all iterations were around the average. My code as written ranged from 31-150.. why such big differences?)


gkatz: interesting solution. Not sure that would work in this case, the image is a frame of video, so it is constantly changing and it is converted from grayscale before it ever gets to the screen.


So, all that said, I think we have good solutions for optimizing the loop itself. I like the idea of not re-allocating the mapped buffer each time, but how then do I handle the fact that I am assigning pMappedBuffer to the packet, and discarding the packet's original 8bit buffer each time?


Brett also said:
>>"ALWAYS."   Ah, the confidence of a young programmer.  

Yes, that was a bit naive. I am pretty confident though that if we change platforms, this entire app will be re-written as MFC will no longer be available. It's funny, I'm going on 10 years writing software, primarilly on the e-commerce end as the internet went boom, but only 1 year with C++ now and a lot of this lower level stuff. I continue to be amazed at how little I know, despite how much I know :)


Again - here is my original function prior to any of Paul and Brett's changes to show you entirely what I was doing:


----- original function----------

// Every frame captured from the 8bit grayscale camera is pushed through the processing chain as
// an instance of CFramePacket ( an auto-reference counting instance that dies when no longer referenced)
// Conversion to 24bit is the first step. (24bit color images are later transparent blitted into it)
DWORD CPacketFilter24BitColor::MakeGrayscale24Bit(auto_ref_ptr<CFramePacket> pPacket)
{
  DWORD dwReturn = ERROR_SUCCESS;

  if (pPacket.get_ref_count() > 0)
  {
    int iSize = pPacket->BufferSize();

    int iNewSize = iSize * 3;
    BYTE* pBuffer = (BYTE*)pPacket->Buffer();
    BYTE* pMappedBuffer = new BYTE[iNewSize];      
    if (!pMappedBuffer)
       return ERROR_OUTOFMEMORY;

    memset(pMappedBuffer, 0, iNewSize  );

    int x;
    BYTE* p = pMappedBuffer;
    for( x = 0; x < iSize; x ++)
    {
       *p++ = pBuffer[x];
       *p++ = pBuffer[x];
       *p++ = pBuffer[x];
    }

    pPacket->m_iSize = iNewSize;
    pPacket->m_iBitDepth = 24;
    pPacket->m_iByteStep = 3;

    // delete the old 8bit buffer
    delete[] pPacket->Buffer();
    pPacket->m_pBuffer = NULL;

    // set my buffer pointer to the address of the mapped buffer
    pPacket->m_pBuffer = pMappedBuffer;
    pMappedBuffer = NULL;

    // update BITMAPINFO
    if (pPacket->m_pbmInfo)
      delete pPacket->m_pbmInfo;

    pPacket->m_pbmInfo = new BITMAPINFO;
    ZeroMemory( &pPacket->m_pbmInfo->bmiHeader, sizeof(BITMAPINFOHEADER) );

    pPacket->m_pbmInfo->bmiHeader.biWidth=pPacket->m_pSourceRect->Width();
    pPacket->m_pbmInfo->bmiHeader.biHeight=pPacket->m_pSourceRect->Height();
    pPacket->m_pbmInfo->bmiHeader.biPlanes=1;
    pPacket->m_pbmInfo->bmiHeader.biBitCount=pPacket->m_iBitDepth;
    pPacket->m_pbmInfo->bmiHeader.biSizeImage = pPacket->BufferSize();
    pPacket->m_pbmInfo->bmiHeader.biSize=sizeof(BITMAPINFOHEADER);
    pPacket->m_pbmInfo->bmiHeader.biClrUsed=0;
    pPacket->m_pbmInfo->bmiHeader.biClrImportant=0;
    pPacket->m_pbmInfo->bmiHeader.biCompression = BI_RGB;
}
return dwReturn;

}
0
 
PMH4514Author Commented:
additionally - when CFramePacket is instantiated,  it checks the type of camera that is creating it (I have two cameras, an 8bit grayscale, and a 24bit color) and it allocates the buffer to the necessary size.. Since I know I am upsampling the grayscale buffer, I could just allocate pPacket->m_pBuffer once as 1000x1000x3 and then when it gets to MakeGrayscale24Bit() - there won't be a need for a source and dest, I just copy from source to source. Would that be better?
0
 
PMH4514Author Commented:
and of course I don't need to recreate the stored BITMAPINFO structure after changing the bitdepth, I can just execute this:

 pPacket->m_pbmInfo->bmiHeader.biBitCount=pPacket->m_iBitDepth;
 pPacket->m_pbmInfo->bmiHeader.biSizeImage = pPacket->BufferSize();
0
 
PMH4514Author Commented:
anyway - I learned alot from this thread folks - I've been profiling methods and using some of these techniques throughout my app, realizing anywhere from 30% to 60% improvements in many of my functions. .Thanks!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 6
  • 4
  • 2
  • +5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now