As chip makers focus on adding processor cores over increasing clock speed, developers need to utilize the features of modern CPUs. One of the ways we can do this is by implementing parallel algorithms in our software.
One recent task I needed to perform at home was to find and document large files in certain folders. I do a back up regularly of documents and source code, and large binaries in those folders can cause overflow on the media I use for storage. So I wanted a program that could scan through all the files in the folder and build a list of files over a certain size. I thought this would be a good opportunity to use some multithreading and see how the performance was improved.
In terms of parallel processing, .NET developers have several choices. BackgroundWorker
objects are useful for performing lengthy calculations out of the UI thread so that the UI can remain responsive. You can create your own threads, just by instantiating new Thread
objects with delegates to the routines you wish to call. You can also use the built in ThreadPool
If you want an aggressively multithreaded application, the ThreadPool
object is ideal. It is designed so that you can queue work items, and as worker threads become available to the pool, the threads will execute the items you queued. It’s convenient as the framework handles the business of queuing and signaling the threads to start.
In terms of design, I took the easy way out. I thought I would create a new work item for each folder and dump it in the queue. I will use a ManualResetEvent
object for the worker threads to signal that they are finished. The ThreadPool
does not have an easy built in way to determine the state of individual work items.
I have a simple class that saves thread state information and is passed into the work item.
Private Class ThreadState
Private mDirInfo As DirectoryInfo
Public Property DirInfo As DirectoryInfo
Set(ByVal value As DirectoryInfo)
mDirInfo = value
'Sum of size of all files in this directory (non-recursive.)
Private mDirectorySize As Long
Public Property DirectorySize As Long
Set(ByVal value As Long)
mDirectorySize = value
Public Sub New(ByVal DirInfo As DirectoryInfo)
Me.DirInfo = DirInfo
Here are some form level declarations which will be explained as we proceed.
'Track files which match the criteria
Private FileList As New List(Of FileInfo)
'Keep track of number of threads still unfinished
Private mNumActiveThreads As Integer
'Allow worker threads to signal back to main thread via waithandle.
Private mEv As New ManualResetEvent(False)
'This is just a handy list to maintain all the directory size information that is gathered.
'ThreadState objects are passed into the procedure used by the individual threads.
Private ThreadStateList As New List(Of ThreadState)
The call to QueueUserWorkItem takes a WaitCallback parameter, which is a delegate to the subroutine that you want the work item to execute. The last parameter is an object through which I use to pass the ThreadState
object to the subroutine. So that we know when the ThreadPool
is finished processing the queue, we have a form level counter that gets decremented each time a thread finishes. When the counter is equal to zero, we will call .Set on the ManualResetEvent
. The calling thread will call the WaitOne
method on the reset event object, which will block until .Set is called by the worker thread.
''' This is the main routine that adds each directory as a
''' new threadpool work item.
''' <param name="StartFolder"></param>
Public Sub ScanAllFilesMultithreaded(ByVal StartFolder As Object)
FolderSize = 0
NumFiles = 0
Dim di As New DirectoryInfo(CStr(StartFolder))
Dim DirInfoList As New List(Of DirectoryInfo)
sw = Stopwatch.StartNew
mNumActiveThreads = DirInfoList.Count
For Each di In DirInfoList
'WaitOne blocks until Set is called on the object in the
''' This adds items to the ThreadState list
''' and adds new work items to the thread pool, all
''' based on the directoryinfo object that is passed in.
''' <param name="di"></param>
Private Sub AddWorkItem(ByVal di As DirectoryInfo)
'Dim Ev As New ManualResetEvent(False)
Dim ts As New ThreadState(di)
New WaitCallback(AddressOf ScanFiles), ts)
The code that is called by each work item is in the ScanFiles subroutine. Notice the use of Interlock.Add and Interlock.Increment. The Interlocked Class
provides these thread-safe methods that ensure that only one thread is manipulating the variable at any one time. Similarly, the FileList.Add method is wrapped in a SyncLock block. SyncLock
restricts the access to the enclosed variables and code by just one thread at a time. It is recommended that you don't overuse it. If large blocks of code are restricted this way, then you are creating a bottleneck for your multithreading.
''' This is the procedure called by the thread pool work items.
''' See AddWorkItem for how this is set up.
''' <param name="state"></param>
Private Sub ScanFiles(ByVal state As Object)
Dim TS As ThreadState = CType(state, ThreadState)
Dim FInfo As FileInfo
For Each FInfo In TS.DirInfo.GetFiles
'Since this routine can run simultaneously
'in different threads, it is important to use
'the Interlocked Increment and Add methods to ensure
'that the form level variables are thread safe and
'not subject to race conditions.
If FInfo.Length > Limit Then
'SyncLock ensures that only this thread
'can access the filelist during this Add
Catch ex As Exception
'When mNumActiveThreads reaches 0, all the workers are finished.
If Interlocked.Decrement(mNumActiveThreads) = 0 Then
Testing and Results
I set up stopwatch objects and implemented a single threaded method that performed the same task, timing it in the same way.
There are definitely large improvements in performance on my quad core system when using multithreading. Results are consistently around 30%-40% of the time taken to scan in a single thread. This was better than I expected, as I was thinking that the process of doing a single directory per thread may be less efficient because of extra overhead. But the ThreadPool
does a pretty good job of making your multithreading as efficient as it can be.
class is a must-have tool for your kit if you have a large number of IO operations that need to happen asynchronously, or if you can spread calculations out in multiple chunks. The full source code for the test project is here:
FileSizes Example Project Download
It was written in Visual Studio 2005.