Why might a Delphi Multi-threaded Windows application show no performance improvement?

I am developing a 32-bit Windows application using Delphi XE3

I have a section code that carries out detailed computation that needs speeding up and can be run in multiple threads

I have designed it so that there is no need to use Mutex's or Critical Sections to control access while the individual threads run.
I make no attempt to set affinity or suggest ideal processor to use

The analysis takes about 1 minute in a single thread
The analysis takes about 40 seconds using multiple threads using all 6 available cores - i was expecting closer to 15-20 seconds (accepting some performance loss as inevitable)

Looking at the CPU performance .. the application does not seem to be using the full 6 cores.

If I replace the analysis code with a simple loop counter to burn CPU time I see full core usage .. and close to 6x speed improvement

So I suspect there is something in the analysis code that is causing threads to wait for each other.

The analysis does make extensive use of lists and creates new objects to put in these lists.

Are there implicit mutex/critical section issues associated with object creation/memory usage that might explain the lack of performance ?
Are there any good tips that might help unlock this performance issue?

Many thanks in anticipation for your thoughts/help

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

MerijnBSr. Software EngineerCommented:
This is a known issue, this is a good read to start: http://blog.synopse.info/post/2010/07/15/Delphi-doesn-t-like-multi-core-CPUs-%28or-the-contrary%29

So basically memory allocation and string handling will cause these issues.
MerijnBSr. Software EngineerCommented:
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Sinisa VukCommented:
There is a great unit: threadpool from Amine Moulay Ramdane (he deals with threads, parallel processing,...)

...another EE question
Geert GOracle dbaCommented:
mostly a calculation can be done in several ways
some are fast, some are very slow

the accuracy of the calculation is also a performance item

there is some threading libraries out there.
by far the most extensive is otl from Primoz ... but also somewhat complex

i can't believe that critical sections (or mutex) would take such a toll performance wise
to find the most time consuming item, you could  use a profiler
this logs every func/proc with a time period
you could also write your own log.
the only thing needed is item/start/end
with excel and pivot it should be easy to determine what the badly performing item is
jon_rsAuthor Commented:
Thanks for all your contributions.

So in summary (as MerijnB states) my problem is associated with
1) Memory Allocation
2) String usage.

My algorithm is a complex recursive search that uses TLists.
The algorithm creates new objects to put into the lists.
So its very design involves creating and freeing memory constantly.
Is there an alternative memory manager that will help me ?
The associated links people have suggested led me to start using ScaleMM2 ( https://code.google.com/p/scalemm/ )
This claims to resolve most of the issues associated with memory allocation locks and does seem to have brought some performance inprovements  

I have reduced my usage of strings within the objects .. but some string parsing is still required.
I have followed the guidelines suggested here http://synopse.info/forum/viewtopic.php?id=322 and converted my use of string passing in func/proc calls to be (const : AString : string)

This also seems to have resulted in me seeing some periods when then core usage is increased .. but it's not solved all my problems.

Would a complete move to using pchar be worth considering?
Is there an alternative to string  (TStringBuilder?) worth considering ... some comments suggest that the standard Delphi version of this still uses strings internally so may not be all that helpful.

MerijnBSr. Software EngineerCommented:
The hard thing is that it's all small steps. The steps you've taken are some (I have to say I tried ScaleMM2 but ended up not using it over FastMM). Moving to pchar might help, using a stringbuilder alternative as well, it all depends on what is actually the bottleneck.

Did you run a profiler? If not, do that first to see what the most problematic bottlenecks are.
jon_rsAuthor Commented:
I haven't run a profiler yet ...
BUT would a profiler help me here?
I am not concerned about the performance of my algorithm ... I am trying to find and avoid hidden asm LOCKS inserted by the compiler that prevent my threads from making full use of the cores.
MerijnBSr. Software EngineerCommented:
You are right, can you try to replace some parts temporary? So instead of making new objects all the time, use some you have pre created up front, or the other way around, do more object creation. See if efficiency goes up or down, and which change (string changes vs object creation) gains the most.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Geert GOracle dbaCommented:
this is only for determining where time is spent ... it's a tool to identify the bottleneck
not a solution to a problem
be aware that a profiler will add to the total time while it's active
it's a tool to get an overall idea where the time consuming parts are

recursive ?
try and write it in a non-recursive way ...

programming something in a recursive way increases the use of the stack
a function call has to push it's attributes on the stack and then pop those from the stack when finished
technically ... everything which is recursive can be written in a non-recursive way
(with a little more code)
rewriting this might only give an extra sec gain
MerijnBSr. Software EngineerCommented:
as an addition to Geert's comment "be aware that a profiler will add to the total time while it's active":

There are 2 kinds of profilers, something called a sampling profilers and instrumenting profilers.
Geert's comment is true for instrumenting profilers, not for sampling profilers.

But Geert, what we need to find out is what code disallowing proper multithreading, not what code is slow, hence a profiler is not really useful.
Geert GOracle dbaCommented:
so you know what time is spent where ?
proper multithreading really depends on what needs to be consecutively executed and what not

for example:
filling up a swimming pool and then jumping in from 5m high i would only do in that order
 the jumping in part only takes a few seconds, but doing it first can have consequences
Geert GOracle dbaCommented:
on a side note: ... what code ?
i haven't seen much code here
jon_rsAuthor Commented:
My code is an optimisation algorithm to find the best combination of  products to manufacture given variations in income/costs associated with the volumes chosen.  It's complicated, and object oriented (so is quite non-linear in its sequencing and dispersion across multiple inherited poly-morphic elements). Trust me .. you don't want to see the code :-)

It is a decision making element for an agent based model ... so these choices are being made by each agent.
The algorithm is a method within an object.
The model has many hundreds of agents, each needing to make this decision every so often

I have until now processed each agent in sequence .. but as each agent's decision making at this stage is independent of the other agents it is suitable for the agents to be batched up and managed within a number of threads.

The search algorithm has to potentially manage an exponentially increasing combination of options.  I use a recursion routine to minimize the search space. The alternative is to consider every single combination ... doing this would cause a massive loss of speed.

So .. for my problem I cannot move aware from a recursive solution
The problem is also memory intensive .. so I manage this by creating objects and lists as and when I need them .. otherwise the whole thing blows up.
I need to stay within a 32-bit environment for the moment because my model interfaces with another 32-bit model

To focus on the actual issue that I think is preventing me making the most of the CPU power:

1) Is there a way to ensure that objects created by another object within a thread will allocate the memory either locally (on the stack) or in some protected part of the heap so that the LOCK does not need to be applied.
ShareMM2 seems to claim to do the second bit ... but my understanding of where Delphi grabs memory for objects in general is limited

2) Are there approaches to managing string concatenation and manipulation that minimize or eliminate the need for the compiler to insert LOCK instructions

MerijnBSr. Software EngineerCommented:
This is old, but maybe you can give it a shot? http://rmarsh.com/2005/08/28/objects-on-the-stack-a-delphi-relic/

This thread is changing from interesting-to-help someone to interesting-to-learn-something ;)
Geert GOracle dbaCommented:
ok, it's a lot of code ... :)
in general,
optimization/planning programs need to load lots of data and investigate lots of possibilities
loading and unloading those data objects does take time
the calculation process usually takes less time than loading

every gain on performance is a plus...
but what is your goal ?

if you need to calculate items and return them into a protected container
maybe a lock-free list to insert them into could also be a solution

the OTL library has such a list mechanism (as do others)
it is something which many multi-thread programmers run into sooner or later ...
queueing and dequeueing a workload can fail or work incredibly fast depending on how the list is setup
jon_rsAuthor Commented:
This is a complex interaction problem that has lots of potential solutions / approaches and which I will be delving into in the coming weeks.
Thanks for all your suggestions, and comments.
I'll close for the time being because the onus is now on me to act on some of the suggestions :-)

If I find out anything else / gain some helpful insights I'll post again

Best wishes

MerijnBSr. Software EngineerCommented:
Good luck and yes, please keep us posted jon, curious how this one rolls out :)
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.