<

A Better Concatenate Function

Published on
35,538 Points
19,438 Views
16 Endorsements
Last Modified:
Awarded
Editor's Choice

Introduction

While answering a recent question (http:/Q_27311462.html), I created an alternative function to the Excel Concatenate() function that you might find useful.  I tested several solutions and share the results in this article as well as the performance analysis and recommendations.  This article serves as a complement to my Fast Data Push to Excel article, since VBA functions often 'pull' data from worksheets.

The Problematic Concatenate() function
I think the biggest problem with the Concatenate() function is that its parameters are text values and not truly range values.  This means that if you wanted to concatenate D4:H4 cell values, you would have to type
=CONCATENATE(D4,E4,F4,G4,H4)

Open in new window


If you supply a range as a parameter, as shown below, you will only see the contents of the D4 cell.
=CONCATENATE(D4:H4)

Open in new window


A secondary problem with the Concatenate() function is the 255 limit on the number of parameters it supports.  This is a secondary problem, since you could concatenate the results of multiple Concatenate() function results.  Users of older versions of Excel were limited to 255 characters per cell, so such a nesting of Concatenate() results would require multiple cells.

Note: If you are only interested in the better concatenate function, you may skip the educational sections that follow and proceed to the end of the article.

There are two parts major components that constitute this better concatenate function: string concatenation and Excel data retrieval.  There are many different combinations of functions and code that might accomplish these two constituent components.  However, some of these perform better and more consistently than others.  I thank my editor's patience while I ran the many performance tests required for me to evaluate which function and code combination is best.

Better Concatenation -- Part 1 (string concatenation)

There are many times where you must concatenate a lot of text data to produce the results you need.  Common examples are:
Creating SQL statements
Creating HTML
Creating delimited lists

The Performance Curve Knee
Most developers are taught, or learn through experience, that string concatenation can be a performance killer.  I'm not writing about individual statements that concatenate string literals.  I'm writing about string concatenation inside of a looping structure (For…Next, For Each…Next, Do…Loop).  You may do a lot of testing and never see a performance problem before your code goes into production.  Once the application sees data in a production environment, your application might slow to a crawl and slow other applications in the process.  That is because concatenation's performance curve has at least one 'knee'.  A 'knee' is a sharp bend in a curve that might reflect very flat/linear growth on one side of the bend and a very sharp rise on the other side of the bend.  It might reflect a noticeable change in the slopes of the lines on either side of the 'knee'.

Concatenation Methods and Their Performance
There are four different methods that can be used to concatenate a lot of string values in a loop.  The first method uses the ampersand (&) or plus sign (+) operator.  This will be our 'traditional' concatenation method -- append each string to the string we had built from the prior iteration.  The other three methods involve populating an array and then using the JOIN() function to concatenate all the array items -- we'll call them DicAndJoin, ColToArrayAndJoin, and ArrayAndJoin.

The ConcatPerfTestResults worksheet in the attached workbook (md5-testing.xls) contains the summarized results.

This test of standard concatenation will ignore/remove the functional feature that allows a delimiter string to be inserted between concatenated values.  This simplifies the code to a point where we are only measuring the performance of the concatenation operation.

Please note that the 'knee' will move as the size of the strings get large.  In this large-string test, I concatenated strings that were seven times larger than my first test (163 versus 23 characters).  The 'knee' moved closer to 2000 iterations, with a smaller 'knee' around 140 iterations.

As expected, the cost-per-iteration increases when using the ampersand operator.  The cost of executing the VBA code is the greatest contributor to the performance at the 10 iteration level, so you should weigh those rows lightly when evaluating the performance of standard concatenation.  What we can easily see is that there is little difference between the two fastest methods and ampersand concatenation operator performance at the 100 iteration level.
Min     	Max     	Avg     	performance per iteration
4.05E-05	4.59E-05	4.28E-05	4.28E-07 (array no delimiter)
6.25E-04	7.12E-04	6.44E-04	6.44E-06 (dictionary no delimiter)

5.13E-05	5.27E-05	5.19E-05	5.19E-07 (amp. concat small string)
2.25E-04	2.50E-04	2.36E-04	2.36E-06 (amp. concat large string)

Open in new window


Once you pass the 100 iteration level, the ampersand concatenation performance starts to degrade below that of the two other methods.  Although we see a 'knee' after 400 iterations, where the iteration increase is less than the incremental (per-iteration) cost.  I continued to test past 10000 iterations and encountered unacceptable performance, so I set 10000 as my new baseline just so I could see what happens in worst-case scenarios.  When you look at the test results in the attached workbook, such baseline changes are noted as a blue fill color and a comment.

Three Clear Winners
As expected, ArrayAndJoin, ColToArrayAndJoin, and DicAndJoin are far superior methods for concatenation, especially as the iterations or size of the string increase.  Although ArrayAndJoin gives better performance, it does require some knowledge of the number of strings that it needs to handle.  If you are feeding the ArrayAndJoin implementation function from an Excel worksheet, you would likely know the number of Cells, Ranges, or Areas to be concatenated.  If feeding from a data source, you would know the number of fields or records.  
Note: There is a follow-up article to this one showing a general purpose string concatenation function.

Since there are two different list objects (collection and dictionary), I needed to compare their relative performance profiles.  Unlike the dictionary's Items array, a collection requires me to transfer the strings to a string array.  That additional work is offset by additional work (hashing) performed by the dictionary and the additional Join() function work (convert variant data to string) when passed the dictionary's Items variant array.

The following table is a comparison to the collection and dictionary Add methods and the subsequent Join() process.  The reported times are normalized by the number of iterations of the Add methods.
Avg           	Min           	Max          	JoinTest
8.78709E-06	4.09898E-06	3.89142E-05	TestColObj
8.93989E-06	4.092E-06 	3.91921E-05	TestColObjWithJoin
1.8528E-05	2.83479E-06	6.59483E-05	TestDicObj
2.15729E-05	5.37892E-06	6.80673E-05	TestDicObjWithJoin

Open in new window


Notes:
Unless noted with a blue color, comparisons are made from the previous row.  Blue color indicates the baseline for subsequent row comparisons.
There is at least one known concatenate operation that bests the Join() operator, courtesy of the VBSpeed folks: http://www.xbeat.net/vbspeed/c_Join.htm#Join09 
If you code a lot of statements that do a lot of string concatenation, you may also encounter this problem without any looping.
A 'knee' is sometimes referred to as a 'hockey stick'
In some cases, results were excluded from the test results, since the VBA engine seems to give poor performance the first execution after a compile and Windows interrupts aren't completely prevented.
The compiled VB run-time environment runs on an 800MHz CPU.  The Excel/VBA run-time environment runs on a 2GHz CPU.

Delimiter Insertion Methods and Their Performance
I tested different methods of inserting a delimiter in the ampersand concatenation loops.  This might seem like a pointless exercise now, but when I started the testing, I didn't want to let any assumptions prevent testing of all my options.  Later testing supported the adage that concatenation can be a poor performer (under the right circumstances), so these results are for your general education.
CheckLen - Use the Len() function to check if the concatenated string length is zero, indicating that this is the first iteration and a delimiter is not needed.
ValueAssign - Always concatenate a delimiter string, but the delimiter string variable is an empty string during the first iteration.  For every subsequent iteration, the delimiter string variable is assigned the value of the delimiter.
PreLoop - The concatenated string is assigned the value of the first item and the looping structure starts iterating at the second item.  This is not easily implemented with a For Each…Next looping structure.
CheckInit - Rather than using the Len() function, a different (intrinsic data type) variable is used to indicate the first time through the loop.  In the test, I used a Boolean variable.

Notes:
There is no substantive performance difference between these four methods of handling delimiters.
There is no substantive difference in the presence of the delimiter for the Join() methods.


Better Concatenation -- Part 2 (accessing cell data)

I am only going to consider solutions that use one of the three knee-less concatenation implementations, ArrayAndJoin, ColToArrayAndJoin, or DicAndJoin.  Since we're going to be working with arrays, it is time to segue into the complementary article material.  In my Fast Data Push to Excel article, I showed you how to populate a contiguous range of cells with a single statement, assigning the range's values to the array.  In this article, we are doing the opposite -- Pulling Data from Excel.  Analogous to the data push method, we can directly assign a contiguous range's values to an array.

The challenge is that we assume that we have a non-contiguous range.  So, we need to do some testing to see which one is the best performer.  What are our method choices?  We can iterate areas or iterate cells.  When iterating areas, we can have six choices:
Iterate the cells in the range
Iterate the cell values in the range
Iterate the cells in each area
Iterate the cell values in each area
Iterate the rows in each area
Assign each area to an array and iterate the array items - expected to be the best performer, since it is the inverse of the Fast Data Push method.

Before I get to the performance testing, I need to explain why I considered six methods.  It all has to do with the JOIN() function and the difference between and array and a vector.  The JOIN() function expects a one dimension array parameter -- what I will call a vector.  However, assigning range/area to an array always produces a two dimensional array.  Even if you access a single row, it doesn't produce the vector expected by the JOIN() function.  I had to find the fastest method to produce a vector, not just an array.

AreasToArrayToVector -- iterate the areas, transferring each area's data into an array, and then copying the array to a vector.
AreasToArrayToDictionary -- iterate the areas, transferring each area's data into an array, and then copying the array to a dictionary object.
ArrayCellsToVector -- iterate the areas, iterate the cells within each area, transferring each cell's data to a vector.
ArrayCellsToDictionary -- iterate the areas, iterate the cells within each area, transferring each cell's data to a dictionary object.
AreasRowsToVector -- iterate the areas, iterate the rows within each area, transferring each row's data to a vector.
AreasRowsToDictionary -- iterate the areas, iterate the rows within each area, transferring each row's data to a dictionary object.
RangeCellsToVector -- iterate the cells, transferring each cell's data to a vector.
RangeCellsToDictionary -- iterate the cells, transferring each cell's data to a dictionary object.

The following summary performance data reflects the best-to-worst times to retrieve Excel data in Avg/area order.
AVG/Cell  	Avg/Area  	Sum Avg/Cell	Sum Avg/Area	Function
2.1344E-04	6.9581E-03	5.4946E-04	4.8717E-02	AreasToArrayToCollection call time
2.1519E-04	7.0152E-03	5.5329E-04	5.4984E-02	AreaCellsToVector call time 
2.8314E-04	9.2304E-03	6.2914E-04	5.7246E-02	AreasToArrayToVector call time
1.8809E-04	6.1317E-03	5.4603E-04	6.7155E-02	AreasToArrayToCollectionToVector call time
2.3292E-04	7.5934E-03	6.2414E-04	7.9608E-02	AreaCellsToDictionary call time 
2.4519E-04	7.9932E-03	6.1508E-04	8.2653E-02	AreasToArrayToDictionary call time
8.8562E-04	2.8871E-02	2.3699E-03	9.8202E-01	AreasRowsToVector call time 
1.1134E-03	3.6298E-02	2.6251E-03	1.0009E+00	AreasRowsToDictionary call time 
1.9728E-04	6.4314E-03	2.5503E-03	3.2107E+00	AreasToArrayAmp call time 
4.6731E-03	1.5234E-01	2.3274E-02	1.6690E+01	RangeCellsToDictionary_RNG call time 
4.6049E-03	1.5012E-01	2.2864E-02	1.6729E+01	RangeCellsToVector_RNG call time 
4.6420E-03	1.5133E-01	2.3340E-02	1.7398E+01	RangeCellsToCollection call time 
4.6296E-03	1.5093E-01	2.3453E-02	1.7438E+01	RangeCellsToVector call time 
5.2866E-03	1.7234E-01	2.4183E-02	1.7714E+01	RangeCellsToDictionary call time 

Open in new window


The following summary performance data reflects the best-to-worst times to retrieve Excel data in Avg/cell order.
AVG/Cell	Avg/Area	Sum Avg/Cell	Sum Avg/Area	Event
1.8809E-04	6.1317E-03	5.4603E-04	6.7155E-02	AreasToArrayToCollectionToVector call time
2.1344E-04	6.9581E-03	5.4946E-04	4.8717E-02	AreasToArrayToCollection call time
2.1519E-04	7.0152E-03	5.5329E-04	5.4984E-02	AreaCellsToVector call time 
2.4519E-04	7.9932E-03	6.1508E-04	8.2653E-02	AreasToArrayToDictionary call time
2.3292E-04	7.5934E-03	6.2414E-04	7.9608E-02	AreaCellsToDictionary call time 
2.8314E-04	9.2304E-03	6.2914E-04	5.7246E-02	AreasToArrayToVector call time
8.8562E-04	2.8871E-02	2.3699E-03	9.8202E-01	AreasRowsToVector call time 
1.9728E-04	6.4314E-03	2.5503E-03	3.2107E+00	AreasToArrayAmp call time 
1.1134E-03	3.6298E-02	2.6251E-03	1.0009E+00	AreasRowsToDictionary call time 
4.6049E-03	1.5012E-01	2.2864E-02	1.6729E+01	RangeCellsToVector_RNG call time 
4.6731E-03	1.5234E-01	2.3274E-02	1.6690E+01	RangeCellsToDictionary_RNG call time 
4.6420E-03	1.5133E-01	2.3340E-02	1.7398E+01	RangeCellsToCollection call time 
4.6296E-03	1.5093E-01	2.3453E-02	1.7438E+01	RangeCellsToVector call time 
5.2866E-03	1.7234E-01	2.4183E-02	1.7714E+01	RangeCellsToDictionary call time

Open in new window

Detailed Join() and Excel data retrieval performance data is here:
JoinAndExcelData-PerfData.xls

My initial tests were run with VBA code inside an Excel workbook.  Unfortunately, the results were too inconsistent, so I moved my testing to a compiled VB classic environment using Excel automation.  I during my tests, I could see the processor time dominated by Excel and the VB application at roughly a 2:1 ratio.  Think of Excel object references like a database call or I/O operation -- rather CPU intensive.  As expected, the fastest Excel data retrieval methods are found when invoking the fewest Excel object references.  Since much of the VB processing takes place in memory, the Excel object methods and property references could be relatively expensive to the speed of the function.  In my smallest test (163 cells in 5 areas), the fastest function spent 10% of its time with a single reference to get the number of cells.
The VB project is here:
BetterConcat-Data.zip
Conditional Compilation Variable Notes:
TimingDetail = 0 -- only time the function call
TimingDetail = 1 -- time the data structure allocation/deallocation and the function call
TimingDetail = 2 -- time the data structure allocation/deallocation, the loop, and the function call
TimingDetail = 4 -- time everything
boolRunInVB=-1 or boolRunInExcel=-1 -- facilitate path determination of output log file.

A fellow software instructor once said, "Every dot in an object reference is a function call."  I had the opportunity to see this aphorism in action.  One occasion, I replaced
parmRangeAreas.Cells.Count

Open in new window

with
parmRangeAreas.Count

Open in new window

After the change, I could see a noticeable improvement in the execution time of that one call and its function.

Notes:
Each object reference requires a function call.
Each period in a reference requires a function call.
From a performance perspective, each instance of data access is the equivalent to an I/O.
While testing array iteration methods, I discovered that a For Each…Next structure iterates a column at a time rather than a row at a time.

Now that I have the data, I will use the best performing of the concatenate and Excel-access methods for the Better Concatenate Function, BCF().

Solution: The BCF() Function

A substantial amount of testing on two different systems leads me to recommend this function as the fastest and most reliable alternative to Excel's Concatenate function.  It has the following advantages:
Accepts multi-area ranges
Provides automatic insertion of a delimiter string
Allows you to control the order (by-row or by-column) of cell iteration in each area.

Public Function BCF(parmRangeAreas As Range, Optional ByVal parmDelim As String, _
                        Optional ByVal parmConcatByCol As Boolean) As String
    'BCF is a Better Concatenation Function
    'Process: Iterate Areas; transfer the area cell data to a
    '    variant array; copy items to string array,
    '    concatenated and returned.
    'Parameters:
    '    parmRangeAreas is a parentheses-enclosed range area and the BCF()
    '    parmDelim is an optional parameter allowing you to add a
    '       delimiter string between the concatenated cell values.
    '    parmConcatByCol is an optional parameter allowing you to control the
    '       order (by row, by column) that area cells are added to the vector
    Dim rngArea As Range
    Dim lngNext As Long
    Dim lngRow As Long
    Dim lngCol
    Dim lngRowCount As Long
    Dim lngColCount As Long
    Dim varCells() As Variant
    Dim strCells() As String
    Dim varItem As Variant
    
    ReDim strCells(1 To parmRangeAreas.Count)
    
    For Each rngArea In parmRangeAreas.Areas
        varCells = rngArea.Value
        If parmConcatByCol Then
            For Each varItem In varCells
                lngNext = lngNext + 1
                strCells(lngNext) = varItem
            Next
        Else
            lngRowCount = UBound(varCells, 1)
            lngColCount = UBound(varCells, 2)
            For lngRow = 1 To lngRowCount
                For lngCol = 1 To lngColCount
                    lngNext = lngNext + 1
                    strCells(lngNext) = varCells(lngRow, lngCol)
                Next
            Next
        End If
    Next
    
    BCF = Join(strCells, parmDelim)
End Function

Open in new window


As I noted in the comment, the range parameter, parmRangeAreas, must be enclosed in parentheses when referencing a multi-area range.
=BCF((F4:Q4,F6:Q6,F8:Q8))

Open in new window


The parentheses are optional if the range only has one area.
=BCF(F4:Q4)

Open in new window


The optional parameter, parmDelim, allows you to inject a character string between the concatenated cell values.  It would be an easy way to create a comma-separated list or multi-line concatenation.

Notes and Thoughts

Although you could pass a string of the range address, it would require additional coding to make sure the cell is recalculated properly, since Excel wouldn't recognize the string as a cell reference.
Since I had to feed a lot of areas into my functions, I ran up against the 255 character limit on range references.  As a work-around, I created the CreateMetaRange() function you will find in the VB code.  I don't expect anyone to need to concatenate 1600+ areas with 60k+ cells, but it can be done.
In my testing, both date and string values concatenate properly.  However, you might need to enhance the BCF() function to format date values.
There is a follow-up article with a general purpose string concatenation function that can be used in all VBA environments.

The workbook containing the source data for performance timing is here:
Compiled-Timing-ResultsHD.xls

Kudos to matthewspatrick for his Word template, used in the writing of this article.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you liked this article and want to see more from this author,  please click here.

If you found this article helpful, please click the Yes button near the:

      Was this article helpful?

label that is just below and to the right of this text.   Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
16
Author:aikimark
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free