Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Excel VBA Efficient Method to Preform Multiple Quick Searches

Posted on 2012-09-03
Medium Priority
Last Modified: 2012-09-07
Hi Experts, I have kind of a challenging problem that I'm trying to figure out.  I need an efficient way to perform a series of fast searches through a list of about 100,000 items in an Excel workbook.  I'm writing a program that will return a set of the closest matches to a given string of characters within a list; it will eventually iterate through all 100,000 items and return the top five closest matches based on a similar matching algorithm.

The issue that I'm having is getting the program to run efficiently.  If I have a search string of 11 characters (for example 'TEST STRING') I need to scan my entire list searching for each possible sub string within the search string (limiting it to a minimum five consecutive characters). (e.g., 'TEST STRING', 'TEST STRIN', 'EST STRING', 'TEST STRI', 'EST STRIN', 'ST STRING'... all the way down to 5 consecutive characters... 'TEST ', 'EST S', 'ST ST', 'T STR', ' STRI', 'STRIN', 'TRING'). And return a set of items that contains each of those sub strings.

As you can see this requires A LOT of searching, and the number of searches grows exponentially depending on the length of the search string.  Initially I did this with brute force by looping through my entire list for each search (which actually worked OK when I only had about 2,000 items).  I've tried creating a database of the items in access then executing SQL statements with a LIKE clause, but these are still pretty inefficient.  I've had some luck using the .find method, but this still takes nearly 10 hours to process the entire list (and I've been seeing some buggy results).

I'm looking for suggestions on different methods of preforming the searches.  I think that there should be some way that I can create a hash which will help me quickly return a set of all items that contain specific subset of characters.  But I haven't worked much with hashes and I haven't been able to figure something out.

-Apologies for the long request; I've actually been asking smaller chunks of this question along the way, but I just keep hitting roadblocks.  Thanks in advance
Question by:JamesCbury
  • 3
  • 2
LVL 46

Accepted Solution

aikimark earned 2000 total points
ID: 38364355
I think you need to implement an algorithm like the Levenshtein

It isn't the only algorithm in that class, so you might want to look at the others linked to at the bottom of the wiki article.

Your current algorithm would be faster if you performed multiple tests with each cell visited.  There is a much higher cost of iterating cells in a range than there is doing your InStr() test.  For each cell visited, you should assign the cell value to a local string variable and do all of your pattern matching using local string variables.

Also, not knowing how long your test strings are, it might be worth testing in a different order:
1. test the entire string, if not matched then 2, else end (found exact match)
2. test the five string combinations, if not matched, end (no match found), else 3
3. use your existing algorithm from Len(string)-1 down to 6

The reason for step 2 is that it prevents you from trying your existing algorithm if there are no possible matches at length 5.  You would proceed to step 3 once you found your first match at step 2.

There might be other efficiencies you can apply to your search.  I would have to know more about the data before I could make any further performance recommendations.
LVL 29

Expert Comment

ID: 38364648
Can you upload a sample spreadsheet and also what you expect your results to look like?

Author Comment

ID: 38366462
aikimark, thanks for the suggestions - I like that approach.  I'm setting the program up now to use the .find method for each five character sub string and create a much smaller collection of  strings that I have to search through.  This is the first that I've heard of Levenshtein distance, but I think that may be a more efficient way for me to perform my string comparison.  Do you happen to have a VBA function that I could plug into my code (similar to the pseudo code on the wiki site)

Sorry IrogSinta, I cannot post the spreadsheet as there is quite a bit of confidential information contained in there.  suffice it to say that the list is about 100,000 items long and is made up of vendor names, the max string length is 45 characters.
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

LVL 46

Expert Comment

ID: 38366562

I no longer have that code.  Our local .Net user group has code challenges and one of them was a Levenshtein algorithm.  I based what I did on (meta-)code I found in a Google search.  I optimized it because we were being judged on speed.

I would expect that you will be using the same character vector at the start of each item comparison.  I don't think it will make much performance difference whether the non-changing item is along the left edge or along the top.  However, I think it will be simpler to code if the non-changing item is along the left side.

You can also apply the matching to different parts of your list in a parallel fashion.  If you have a quad core processor, you can break your big list into 25k item lists and run the fuzzy matching process on each of the four processors.  You can do this on as many PCs as you have access to.  Even your current algorithm will end in a reasonable time with this approach.

In your question text, you mentioned hashes.  You could use hashes in the #2 step I've mentioned.  You will generate a long integer hash of the five character substrings of all the items in both lists.  If your max string length is 45, let's assume an average of 40 characters.  You will have (40-5=)35 substrings that you will hash for the big list items.  I don't know how many items you have in the small list, but infer that it is more than 2000 (with unknown length profile).  The hashes will be sorted and indexed for each item in each list.  Instead of iterating the length-five substrings in step #2, you would do a merge match of the two ordered lists of hashes.  You would need an additional 35 MB for the big list hash values.

Alternative 1: you could have an array of dictionary objects, using the hash values as keys (beware of duplicates, preventing with the .Exists method before adding). You don't have to do any sorting, but there will be more overhead for the dictionary object than just an array and hashed lookups are quick and require less coding than merge match.

Alternative 2: you could take advantage of the hashing that is built into the dictionary object and store all the possible substrings for the big list items.  You would still use the same algorithm I've described, but use the .Exists method instead of substring.  From a memory and efficiency standpoint, you could populate the dictionary object for each of the big list items as you iterate it and clear the dictionary object at the end of each iteration (.RemoveAll method).  I expect that there will be around 630 dictionary items added for each iteration.

In all of this, I assume you are keeping a top-5 list of best matches (in the small list) for each item in the big list.

Author Comment

ID: 38376248
Apologies for the delay in accepting your solution - there was a lot here to digest, but this is exactly the type of guidance I was looking for - THANKS.

I would be interested in learning more about how to split my processing across multiple processors (can I do that from within VB?)  that would save me a lot of time in the future.
LVL 46

Expert Comment

ID: 38376750
I don't know how your code is packaged.  You need to communicate with the code where it should start in the big list and how many big list items to process.  You might also need to tell it where to put the results, but that might be derived from the other parameter.

You could manually start the process on different PCs, changing some constants in the code.

If running on a multi-core PC, you could use the START command, using its /AFFINITY parameter.

You also have the option of moving these two lists out into separate ASCII files if that helps you distribute the work.  Sometimes, Office documents open exclusive.

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows how to get a list of available printers for display in a drop-down list, and then to use the selected printer to print an Access report or a Word document filled with Access data, using different syntax as needed for working with …
A quick solution showing how to control and open a POS Cash Register Drawer using VBA with MS Access.
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…
Enter Foreign and Special Characters Enter characters you can't find on a keyboard using its ASCII code ... and learn how to make a handy reference for yourself using Excel ~ Use these codes in any Windows application! ... whether it is a Micr…

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question