Solved

speed up yuv422 to yuv420 software conversion

Posted on 2012-03-19
2
1,166 Views
Last Modified: 2012-08-13
Hi,

I have used the following code to convert yuv422 to yuv420 images.

void ConvertUyvyToYuv420P(uint8_t* destFrame,
                                            uint8_t* srcFrame,
                                            int width,
                                            int height)
      {
            
            uint8_t* pyFrame = destFrame;
            uint8_t* puFrame = pyFrame + width*height;
            uint8_t* pvFrame = puFrame + width*height/4;
            
            int uvOffset = width * 4 * sizeof(uint8_t);
            
            int i,j;
            
            for(i=0; i<height-2; i++)
            {
                  for(j=0;j<width;j+=2)
                  {
                        uint16_t calc;
                            if ((i&1) == 0)
                            {
                                  calc = *srcFrame;
                                  calc += *(srcFrame + uvOffset);
                                    calc /= 2;
                                  *puFrame++ = (uint8_t) calc;
                                 }
                             srcFrame++;
                           *pyFrame++ = *srcFrame++;
                           if ((i&1) == 0)
                           {
                              calc = *srcFrame;
                              calc += *(srcFrame + uvOffset);
                              calc /= 2;
                              *pvFrame++ = (uint8_t) calc;
                               }
                           srcFrame++;
                           *pyFrame++ = *srcFrame++;
                      }
               }
       }

When I used this on 1080p input at 30 frames per second I am able to convert only at 15 frames per second, is there any way to improve the above snippets speed or is there a better algorithm for conversion.

Any help would be great!!
Thanks
0
Comment
Question by:Shiv_Sg
2 Comments
 
LVL 3

Accepted Solution

by:
algorith earned 500 total points
Comment Utility
Hi, a lot of this depends on what system you are programming for. As you probably know, many have multiple CPUs, and hyperthreading can make an individual CPU look like 2.  In this case you could separate your outer loop above into multiple threads, each doing their work in parallel.  Or, you could put your entire subroutine into a thread, then spawn as many frame processing threads as there are (effective) CPUs so that you could do multiple frames in parallel - this is my suggestion for the best approach.

On the other hand, if the memory bandwidth is too low then no amount of threading will improve the situation. To check this for your target system: you need to address at least (width * height) locations, so make sure this is not an outrageous number to do at 30 fps on whatever hardware you have. e.g. 1920*1080*30 = 30 Mb/s or so, not an outrageous number for some hardware, out of the question for others.  Keep in mind that addressing many MB of serial locations in memory may not be as fast as you would expect from the computer's specs, so needs to be tested.

In lieu of multi-threading, other approaches may be warranted. Among the simpler  things to do are to remove anything you can from the inner loop. However, depending on the processor, and its out-of-sequence scheduling algorithms, it is sometimes surprising what exactly will speed up a loop.

In particular, it is often possible to trade off increased space for reduced speed, so adjusting your algorithm to make 2 passes might enable you to eliminate the tests "if((i&1) == 0) {...}".  I have not played with this kind of bit twiddling for a while, but sometimes just changing the addressing scheme from dereferencing pointers to using array subscripts might enable the compiler optimizer or the processor scheduling to figure out what to do.

Finally, I assume that optimization is turned on in the compiler?  Sometimes optimizing for speed is not the correct way to go, and optimizing for minimal code size works better. Although again, I have not played with this for some time and today this may not be a distinction.

good luck!
0
 

Author Comment

by:Shiv_Sg
Comment Utility
Hi Thanks a lot for the quick response.. I currently have dual core CPU so would try to add more threads and see how much it improves..

I am also trying to modify the code based on ur space tradeoff suggestion in para 3, ll let u know how things improve up.

Thanks again.
Regards
Shiv
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
C++ Properties One feature missing from standard C++ that you will find in many other Object Oriented Programming languages is something called a Property (http://www.experts-exchange.com/Programming/Languages/CPP/A_3912-Object-Properties-in-C.ht…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
Viewers will get an overview of how to make and use Drum Racks in Ableton Live. Load new Drum Rack into empty MIDI track: Fill rack with audio samples: Re-arrange sample slots as necessary: Adjust parameters of each slot to tailor each sound a…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now