Solved

speed up yuv422 to yuv420 software conversion

Posted on 2012-03-19
2
1,242 Views
Last Modified: 2012-08-13
Hi,

I have used the following code to convert yuv422 to yuv420 images.

void ConvertUyvyToYuv420P(uint8_t* destFrame,
                                            uint8_t* srcFrame,
                                            int width,
                                            int height)
      {
            
            uint8_t* pyFrame = destFrame;
            uint8_t* puFrame = pyFrame + width*height;
            uint8_t* pvFrame = puFrame + width*height/4;
            
            int uvOffset = width * 4 * sizeof(uint8_t);
            
            int i,j;
            
            for(i=0; i<height-2; i++)
            {
                  for(j=0;j<width;j+=2)
                  {
                        uint16_t calc;
                            if ((i&1) == 0)
                            {
                                  calc = *srcFrame;
                                  calc += *(srcFrame + uvOffset);
                                    calc /= 2;
                                  *puFrame++ = (uint8_t) calc;
                                 }
                             srcFrame++;
                           *pyFrame++ = *srcFrame++;
                           if ((i&1) == 0)
                           {
                              calc = *srcFrame;
                              calc += *(srcFrame + uvOffset);
                              calc /= 2;
                              *pvFrame++ = (uint8_t) calc;
                               }
                           srcFrame++;
                           *pyFrame++ = *srcFrame++;
                      }
               }
       }

When I used this on 1080p input at 30 frames per second I am able to convert only at 15 frames per second, is there any way to improve the above snippets speed or is there a better algorithm for conversion.

Any help would be great!!
Thanks
0
Comment
Question by:Shiv_Sg
2 Comments
 
LVL 3

Accepted Solution

by:
algorith earned 500 total points
ID: 37744127
Hi, a lot of this depends on what system you are programming for. As you probably know, many have multiple CPUs, and hyperthreading can make an individual CPU look like 2.  In this case you could separate your outer loop above into multiple threads, each doing their work in parallel.  Or, you could put your entire subroutine into a thread, then spawn as many frame processing threads as there are (effective) CPUs so that you could do multiple frames in parallel - this is my suggestion for the best approach.

On the other hand, if the memory bandwidth is too low then no amount of threading will improve the situation. To check this for your target system: you need to address at least (width * height) locations, so make sure this is not an outrageous number to do at 30 fps on whatever hardware you have. e.g. 1920*1080*30 = 30 Mb/s or so, not an outrageous number for some hardware, out of the question for others.  Keep in mind that addressing many MB of serial locations in memory may not be as fast as you would expect from the computer's specs, so needs to be tested.

In lieu of multi-threading, other approaches may be warranted. Among the simpler  things to do are to remove anything you can from the inner loop. However, depending on the processor, and its out-of-sequence scheduling algorithms, it is sometimes surprising what exactly will speed up a loop.

In particular, it is often possible to trade off increased space for reduced speed, so adjusting your algorithm to make 2 passes might enable you to eliminate the tests "if((i&1) == 0) {...}".  I have not played with this kind of bit twiddling for a while, but sometimes just changing the addressing scheme from dereferencing pointers to using array subscripts might enable the compiler optimizer or the processor scheduling to figure out what to do.

Finally, I assume that optimization is turned on in the compiler?  Sometimes optimizing for speed is not the correct way to go, and optimizing for minimal code size works better. Although again, I have not played with this for some time and today this may not be a distinction.

good luck!
0
 

Author Comment

by:Shiv_Sg
ID: 37746071
Hi Thanks a lot for the quick response.. I currently have dual core CPU so would try to add more threads and see how much it improves..

I am also trying to modify the code based on ur space tradeoff suggestion in para 3, ll let u know how things improve up.

Thanks again.
Regards
Shiv
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the first in a series of articles about the C/C++ Visual Studio Express debugger.  It provides a quick start guide in using the debugger. Part 2 focuses on additional topics in breakpoints.  Lastly, Part 3 focuses on th…
The advancements in today's technology are unparalleled. Much of the technology that we have could not have been imagined twenty years ago. One of the latest additions to the list of technological advances is virtual reality. Virtual reality has an …
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Viewers will learn the basics of using Ableton Live's advanced sampler instrument, Sampler. Load new Sampler into an empty MIDI track: Select a sample and drop it into sample window in Sampler: Adjust pitch if necessary with Root Key setting: …

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question