Go Premium for a chance to win a PS4. Enter to Win


How code organization is related to performance (MIPS architecture)

Posted on 2001-06-25
Medium Priority
Last Modified: 2010-04-15
I'm working on code for packet processing device running on MIPS processor using Windriver's VxWorks Operating System. Recently I started measuring the performance of the system. I found that minor code changes not related at all to the core function is causing big performance difference. Studying it further I found that adding dummy code of few bytes on certain locations in the code may hit the performance up to 25%.
I did not find any linkage to any compiler/linker switches or to the linking order.
I know MIPS is sensitive to code alignment but I'm sure the compiler / Linker are taking care of that.
Any Idea what can cause such a big difference in performance?

Appreciate your help,

Question by:Dan_Shule

Expert Comment

ID: 6226228
well, the easiest thing to do is look at the actual assembly code after you disassemble it and look at what it's doing. Where you put code, how your structures are packed, etc.. there can be many variants. Long time ago I did this for Alpha... write C code, disassemble it and analyze the assembly code to fine tune.

Expert Comment

ID: 6226230
If you haven't yet. You should def. read this book. Strongly recommended.


Accepted Solution

Slordak earned 600 total points
ID: 6233463
Speaking generally, there are a huge number of things which can impact the performance of a given block of code, and the performance can in fact vary widely with just a few changes which don't influence the core algorithm.  It is true that with an efficient optimizing compiler, most of these can be addressed by the compiler (instruction scheduling, loop unrolling, register allocation), but there are still cases where this is not true.  The best architectures tend to be those where the compiler can work together with the processor to improve performance, such as by giving hints about which way a branch will typically be taken or by pre-fetching code which will be needed soon.

Some examples of areas in which minor changes can have relatively large impacts:

1) Cache sizes / Working sets - If the code and data for a given routine fits entirely in the L1 cache of the processor, it's easy to see how this will give the best performance.  Adding a few stray variables or chunks of code which push the size over the limit can result in having to fetch cache blocks in and out, in effect thrashing the cache.

  An interesting example of this... Let's say your data cache with LRU replacement policy can cache 32 elements of some array you're looping through.  If you have an array of size 32, you're in luck, the entire array is cached after going through the loop once.  However consider if you change the size to 33.  Now, when you hit the very last element, the first element is replaced.  Next you attempt to work with the first element, but it's been replaced, so you fetch the first element and replace the second element with it, etc. etc.  Now, nothing ever hits in the cache on each and every iteration!  Very bad.

2) Instruction & register scheduling - While admittedly this likely doesn't apply to any embedded MIPS processors, on "out of order execution" processors, having the right types of instructions available at the right times with minimal interdependencies greatly increases the average number of instructions which can be executed per clock.  The compiler usually takes care of this, thankfully, but it is possible to give bad hints even on sequential execution processors by doing things such as asking to have infrequently used variables always stored in registers (by declaring variables of type "register") when the register could more effectively be used for other values.

  It's more than likely that what you're seeing is in some way related to #1 above, either that you've crossed page boundaries by adding a few lines of code or that you've increased the size of the working set by just enough to greatly reduce the efficiency of the cache.

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.


Author Comment

ID: 6256658
Looking further into this I believe it has to do with the use of packed structures in the code.
Using packed structures eases handling of TCP/IP packets by enabling direct access to the various fields. It also improve the performance, for example, when preparing a packet to be sent, a block copy can be used instead of byte by byte copy.
It seems like the packed structures cause the alignment problem. I was sure the compiler is taking care of padding aneven structures...
Am I supposed to take care of the padding manually? How?
Is there any special treatment for packed structure that I'm missing?


Expert Comment

ID: 6725306
Be careful.  All CPUs have different assumptions about how structures can/should be packed.  The compiler takes care of it, the best it can.  It depends on how optimized the compiler is for your particular CPU.  There are generally LOTS of rules you have to know if you want to manually optimize this in assembly language.

Perhaps it would be good to step back for a minute.

Here is how I would recommend you approach it:

1. What is the performance problem?  Too long to do a certain operation?  Too much memory used?  Is the CPU waiting for data, or is the rest of the program waiting for the CPU?

2. What are the exact steps to reproduce the performance problem?  You'll need to know this in order to verify that any enhancements you make actually work.

3. Where in the code does it look like the problem is?  Use comments and timing to figure out which blocks are running too slow or using too much memory.  On Windows, you can use GetTickCount() to see how long a particular block of code takes to run.

When you find the block (large or small) where the problem seems to be, then shift to the next part.  Diagram on your white board, how data moves, and how control goes to figure out what is really happening (not what you think is happening.  Things to look for:

* Is there a loop that gets executed many times (like millions or more)?
* Are you doing some searching?
* Are you doing some sorting?
* Are there large structures (linked lists, queues, etc.) being created and destroyed multiple times?

If you are having a problem in the search or sort code, then consider changing algorithms.  Maybe you started with a quick and easy bubble sort, but now you have so much data, that you need to change to a quicksort.  Maybe you are using a simple linear search, but now you need to upgrade to a hash table.  Maybe there is something you are doing repeatedly inside a loop that can be moved outside the loop, and the intermediate result saved.

In general, consider looking at things in this order, to maximize your time, and only fix what needs to be fixed:

1. Algorithms
2. Unnecessary memory allocation/deallocation
3. Unnecessary file I/O
4. Unnecessary repeated calls to same function
5. Loop optimization
6. Unnecessary parameter passing
7. Functions that can be moved inline, instead of having the overhead of a call.

Those are a few thoughts.


Author Comment

ID: 6949175
Your comments are interesting though It did not help to solve my specific problem.


Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and inf…
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.
Suggested Courses

885 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question