Solved

Why might a Delphi Multi-threaded Windows application show no performance improvement?

Posted on 2014-09-12
18
503 Views
Last Modified: 2014-09-19
I am developing a 32-bit Windows application using Delphi XE3

I have a section code that carries out detailed computation that needs speeding up and can be run in multiple threads

I have designed it so that there is no need to use Mutex's or Critical Sections to control access while the individual threads run.
I make no attempt to set affinity or suggest ideal processor to use

The analysis takes about 1 minute in a single thread
The analysis takes about 40 seconds using multiple threads using all 6 available cores - i was expecting closer to 15-20 seconds (accepting some performance loss as inevitable)

Looking at the CPU performance .. the application does not seem to be using the full 6 cores.

If I replace the analysis code with a simple loop counter to burn CPU time I see full core usage .. and close to 6x speed improvement

So I suspect there is something in the analysis code that is causing threads to wait for each other.

The analysis does make extensive use of lists and creates new objects to put in these lists.

Are there implicit mutex/critical section issues associated with object creation/memory usage that might explain the lack of performance ?
Are there any good tips that might help unlock this performance issue?

Many thanks in anticipation for your thoughts/help

Jon
0
Comment
Question by:jon_rs
  • 8
  • 5
  • 4
  • +1
18 Comments
 
LVL 19

Assisted Solution

by:MerijnB
MerijnB earned 429 total points
ID: 40319011
This is a known issue, this is a good read to start: http://blog.synopse.info/post/2010/07/15/Delphi-doesn-t-like-multi-core-CPUs-%28or-the-contrary%29

So basically memory allocation and string handling will cause these issues.
0
 
LVL 19

Assisted Solution

by:MerijnB
MerijnB earned 429 total points
ID: 40319018
0
 
LVL 19

Assisted Solution

by:MerijnB
MerijnB earned 429 total points
ID: 40319060
0
 
LVL 25

Assisted Solution

by:Sinisa Vuk
Sinisa Vuk earned 71 total points
ID: 40319119
There is a great unit: threadpool from Amine Moulay Ramdane (he deals with threads, parallel processing,...)

...another EE question
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40322524
mostly a calculation can be done in several ways
some are fast, some are very slow

the accuracy of the calculation is also a performance item

there is some threading libraries out there.
by far the most extensive is otl from Primoz ... but also somewhat complex
http://otl.17slon.com/

i can't believe that critical sections (or mutex) would take such a toll performance wise
to find the most time consuming item, you could  use a profiler
this logs every func/proc with a time period
you could also write your own log.
the only thing needed is item/start/end
with excel and pivot it should be easy to determine what the badly performing item is
0
 

Author Comment

by:jon_rs
ID: 40322781
Thanks for all your contributions.

So in summary (as MerijnB states) my problem is associated with
1) Memory Allocation
2) String usage.

My algorithm is a complex recursive search that uses TLists.
The algorithm creates new objects to put into the lists.
So its very design involves creating and freeing memory constantly.
Is there an alternative memory manager that will help me ?
The associated links people have suggested led me to start using ScaleMM2 ( https://code.google.com/p/scalemm/ )
This claims to resolve most of the issues associated with memory allocation locks and does seem to have brought some performance inprovements  

I have reduced my usage of strings within the objects .. but some string parsing is still required.
I have followed the guidelines suggested here http://synopse.info/forum/viewtopic.php?id=322 and converted my use of string passing in func/proc calls to be (const : AString : string)

This also seems to have resulted in me seeing some periods when then core usage is increased .. but it's not solved all my problems.

Would a complete move to using pchar be worth considering?
Is there an alternative to string  (TStringBuilder?) worth considering ... some comments suggest that the standard Delphi version of this still uses strings internally so may not be all that helpful.

Jon
0
 
LVL 19

Assisted Solution

by:MerijnB
MerijnB earned 429 total points
ID: 40322821
The hard thing is that it's all small steps. The steps you've taken are some (I have to say I tried ScaleMM2 but ended up not using it over FastMM). Moving to pchar might help, using a stringbuilder alternative as well, it all depends on what is actually the bottleneck.

Did you run a profiler? If not, do that first to see what the most problematic bottlenecks are.
0
 

Author Comment

by:jon_rs
ID: 40322930
I haven't run a profiler yet ...
 
BUT would a profiler help me here?
I am not concerned about the performance of my algorithm ... I am trying to find and avoid hidden asm LOCKS inserted by the compiler that prevent my threads from making full use of the cores.
0
 
LVL 19

Accepted Solution

by:
MerijnB earned 429 total points
ID: 40323497
You are right, can you try to replace some parts temporary? So instead of making new objects all the time, use some you have pre created up front, or the other way around, do more object creation. See if efficiency goes up or down, and which change (string changes vs object creation) gains the most.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40324869
profiler
this is only for determining where time is spent ... it's a tool to identify the bottleneck
not a solution to a problem
be aware that a profiler will add to the total time while it's active
it's a tool to get an overall idea where the time consuming parts are

recursive ?
try and write it in a non-recursive way ...

programming something in a recursive way increases the use of the stack
a function call has to push it's attributes on the stack and then pop those from the stack when finished
technically ... everything which is recursive can be written in a non-recursive way
(with a little more code)
rewriting this might only give an extra sec gain
0
 
LVL 19

Expert Comment

by:MerijnB
ID: 40324871
as an addition to Geert's comment "be aware that a profiler will add to the total time while it's active":

There are 2 kinds of profilers, something called a sampling profilers and instrumenting profilers.
Geert's comment is true for instrumenting profilers, not for sampling profilers.

But Geert, what we need to find out is what code disallowing proper multithreading, not what code is slow, hence a profiler is not really useful.
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40324983
so you know what time is spent where ?
proper multithreading really depends on what needs to be consecutively executed and what not

for example:
filling up a swimming pool and then jumping in from 5m high i would only do in that order
 the jumping in part only takes a few seconds, but doing it first can have consequences
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40324987
on a side note: ... what code ?
i haven't seen much code here
0
 

Author Comment

by:jon_rs
ID: 40325033
My code is an optimisation algorithm to find the best combination of  products to manufacture given variations in income/costs associated with the volumes chosen.  It's complicated, and object oriented (so is quite non-linear in its sequencing and dispersion across multiple inherited poly-morphic elements). Trust me .. you don't want to see the code :-)

It is a decision making element for an agent based model ... so these choices are being made by each agent.
The algorithm is a method within an object.
The model has many hundreds of agents, each needing to make this decision every so often

I have until now processed each agent in sequence .. but as each agent's decision making at this stage is independent of the other agents it is suitable for the agents to be batched up and managed within a number of threads.

The search algorithm has to potentially manage an exponentially increasing combination of options.  I use a recursion routine to minimize the search space. The alternative is to consider every single combination ... doing this would cause a massive loss of speed.

So .. for my problem I cannot move aware from a recursive solution
The problem is also memory intensive .. so I manage this by creating objects and lists as and when I need them .. otherwise the whole thing blows up.
I need to stay within a 32-bit environment for the moment because my model interfaces with another 32-bit model

To focus on the actual issue that I think is preventing me making the most of the CPU power:

1) Is there a way to ensure that objects created by another object within a thread will allocate the memory either locally (on the stack) or in some protected part of the heap so that the LOCK does not need to be applied.
ShareMM2 seems to claim to do the second bit ... but my understanding of where Delphi grabs memory for objects in general is limited

2) Are there approaches to managing string concatenation and manipulation that minimize or eliminate the need for the compiler to insert LOCK instructions

Jon
0
 
LVL 19

Assisted Solution

by:MerijnB
MerijnB earned 429 total points
ID: 40325081
This is old, but maybe you can give it a shot? http://rmarsh.com/2005/08/28/objects-on-the-stack-a-delphi-relic/

This thread is changing from interesting-to-help someone to interesting-to-learn-something ;)
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40330504
ok, it's a lot of code ... :)
in general,
optimization/planning programs need to load lots of data and investigate lots of possibilities
loading and unloading those data objects does take time
the calculation process usually takes less time than loading

every gain on performance is a plus...
but what is your goal ?

if you need to calculate items and return them into a protected container
maybe a lock-free list to insert them into could also be a solution

the OTL library has such a list mechanism (as do others)
it is something which many multi-thread programmers run into sooner or later ...
queueing and dequeueing a workload can fail or work incredibly fast depending on how the list is setup
0
 

Author Closing Comment

by:jon_rs
ID: 40333718
This is a complex interaction problem that has lots of potential solutions / approaches and which I will be delving into in the coming weeks.
Thanks for all your suggestions, and comments.
I'll close for the time being because the onus is now on me to act on some of the suggestions :-)

If I find out anything else / gain some helpful insights I'll post again

Best wishes

Jon
0
 
LVL 19

Expert Comment

by:MerijnB
ID: 40334105
Good luck and yes, please keep us posted jon, curious how this one rolls out :)
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Using YubiKey with REST API application 2 82
strCount chalenge 3 51
Working with hours 3 31
Homework Math Question 1 65
Article by: Nadia
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now