• C

How code organization is related to performance (MIPS architecture)

I'm working on code for packet processing device running on MIPS processor using Windriver's VxWorks Operating System. Recently I started measuring the performance of the system. I found that minor code changes not related at all to the core function is causing big performance difference. Studying it further I found that adding dummy code of few bytes on certain locations in the code may hit the performance up to 25%.
I did not find any linkage to any compiler/linker switches or to the linking order.
I know MIPS is sensitive to code alignment but I'm sure the compiler / Linker are taking care of that.
Any Idea what can cause such a big difference in performance?

Appreciate your help,

Who is Participating?
SlordakConnect With a Mentor Commented:
Speaking generally, there are a huge number of things which can impact the performance of a given block of code, and the performance can in fact vary widely with just a few changes which don't influence the core algorithm.  It is true that with an efficient optimizing compiler, most of these can be addressed by the compiler (instruction scheduling, loop unrolling, register allocation), but there are still cases where this is not true.  The best architectures tend to be those where the compiler can work together with the processor to improve performance, such as by giving hints about which way a branch will typically be taken or by pre-fetching code which will be needed soon.

Some examples of areas in which minor changes can have relatively large impacts:

1) Cache sizes / Working sets - If the code and data for a given routine fits entirely in the L1 cache of the processor, it's easy to see how this will give the best performance.  Adding a few stray variables or chunks of code which push the size over the limit can result in having to fetch cache blocks in and out, in effect thrashing the cache.

  An interesting example of this... Let's say your data cache with LRU replacement policy can cache 32 elements of some array you're looping through.  If you have an array of size 32, you're in luck, the entire array is cached after going through the loop once.  However consider if you change the size to 33.  Now, when you hit the very last element, the first element is replaced.  Next you attempt to work with the first element, but it's been replaced, so you fetch the first element and replace the second element with it, etc. etc.  Now, nothing ever hits in the cache on each and every iteration!  Very bad.

2) Instruction & register scheduling - While admittedly this likely doesn't apply to any embedded MIPS processors, on "out of order execution" processors, having the right types of instructions available at the right times with minimal interdependencies greatly increases the average number of instructions which can be executed per clock.  The compiler usually takes care of this, thankfully, but it is possible to give bad hints even on sequential execution processors by doing things such as asking to have infrequently used variables always stored in registers (by declaring variables of type "register") when the register could more effectively be used for other values.

  It's more than likely that what you're seeing is in some way related to #1 above, either that you've crossed page boundaries by adding a few lines of code or that you've increased the size of the working set by just enough to greatly reduce the efficiency of the cache.
well, the easiest thing to do is look at the actual assembly code after you disassemble it and look at what it's doing. Where you put code, how your structures are packed, etc.. there can be many variants. Long time ago I did this for Alpha... write C code, disassemble it and analyze the assembly code to fine tune.
If you haven't yet. You should def. read this book. Strongly recommended.

Firewall Management 201 with Professor Wool

In this whiteboard video, Professor Wool highlights the challenges, benefits and trade-offs of utilizing zero-touch automation for security policy change management. Watch and Learn!

Dan_ShuleAuthor Commented:
Looking further into this I believe it has to do with the use of packed structures in the code.
Using packed structures eases handling of TCP/IP packets by enabling direct access to the various fields. It also improve the performance, for example, when preparing a packet to be sent, a block copy can be used instead of byte by byte copy.
It seems like the packed structures cause the alignment problem. I was sure the compiler is taking care of padding aneven structures...
Am I supposed to take care of the padding manually? How?
Is there any special treatment for packed structure that I'm missing?

Be careful.  All CPUs have different assumptions about how structures can/should be packed.  The compiler takes care of it, the best it can.  It depends on how optimized the compiler is for your particular CPU.  There are generally LOTS of rules you have to know if you want to manually optimize this in assembly language.

Perhaps it would be good to step back for a minute.

Here is how I would recommend you approach it:

1. What is the performance problem?  Too long to do a certain operation?  Too much memory used?  Is the CPU waiting for data, or is the rest of the program waiting for the CPU?

2. What are the exact steps to reproduce the performance problem?  You'll need to know this in order to verify that any enhancements you make actually work.

3. Where in the code does it look like the problem is?  Use comments and timing to figure out which blocks are running too slow or using too much memory.  On Windows, you can use GetTickCount() to see how long a particular block of code takes to run.

When you find the block (large or small) where the problem seems to be, then shift to the next part.  Diagram on your white board, how data moves, and how control goes to figure out what is really happening (not what you think is happening.  Things to look for:

* Is there a loop that gets executed many times (like millions or more)?
* Are you doing some searching?
* Are you doing some sorting?
* Are there large structures (linked lists, queues, etc.) being created and destroyed multiple times?

If you are having a problem in the search or sort code, then consider changing algorithms.  Maybe you started with a quick and easy bubble sort, but now you have so much data, that you need to change to a quicksort.  Maybe you are using a simple linear search, but now you need to upgrade to a hash table.  Maybe there is something you are doing repeatedly inside a loop that can be moved outside the loop, and the intermediate result saved.

In general, consider looking at things in this order, to maximize your time, and only fix what needs to be fixed:

1. Algorithms
2. Unnecessary memory allocation/deallocation
3. Unnecessary file I/O
4. Unnecessary repeated calls to same function
5. Loop optimization
6. Unnecessary parameter passing
7. Functions that can be moved inline, instead of having the overhead of a call.

Those are a few thoughts.

Dan_ShuleAuthor Commented:
Your comments are interesting though It did not help to solve my specific problem.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.