Excel VBA Efficient Method to Preform Multiple Quick Searches

Posted on 2012-09-03
Last Modified: 2012-09-07
Hi Experts, I have kind of a challenging problem that I'm trying to figure out.  I need an efficient way to perform a series of fast searches through a list of about 100,000 items in an Excel workbook.  I'm writing a program that will return a set of the closest matches to a given string of characters within a list; it will eventually iterate through all 100,000 items and return the top five closest matches based on a similar matching algorithm.

The issue that I'm having is getting the program to run efficiently.  If I have a search string of 11 characters (for example 'TEST STRING') I need to scan my entire list searching for each possible sub string within the search string (limiting it to a minimum five consecutive characters). (e.g., 'TEST STRING', 'TEST STRIN', 'EST STRING', 'TEST STRI', 'EST STRIN', 'ST STRING'... all the way down to 5 consecutive characters... 'TEST ', 'EST S', 'ST ST', 'T STR', ' STRI', 'STRIN', 'TRING'). And return a set of items that contains each of those sub strings.

As you can see this requires A LOT of searching, and the number of searches grows exponentially depending on the length of the search string.  Initially I did this with brute force by looping through my entire list for each search (which actually worked OK when I only had about 2,000 items).  I've tried creating a database of the items in access then executing SQL statements with a LIKE clause, but these are still pretty inefficient.  I've had some luck using the .find method, but this still takes nearly 10 hours to process the entire list (and I've been seeing some buggy results).

I'm looking for suggestions on different methods of preforming the searches.  I think that there should be some way that I can create a hash which will help me quickly return a set of all items that contain specific subset of characters.  But I haven't worked much with hashes and I haven't been able to figure something out.

-Apologies for the long request; I've actually been asking smaller chunks of this question along the way, but I just keep hitting roadblocks.  Thanks in advance
Question by:JamesCbury
    LVL 44

    Accepted Solution

    I think you need to implement an algorithm like the Levenshtein

    It isn't the only algorithm in that class, so you might want to look at the others linked to at the bottom of the wiki article.

    Your current algorithm would be faster if you performed multiple tests with each cell visited.  There is a much higher cost of iterating cells in a range than there is doing your InStr() test.  For each cell visited, you should assign the cell value to a local string variable and do all of your pattern matching using local string variables.

    Also, not knowing how long your test strings are, it might be worth testing in a different order:
    1. test the entire string, if not matched then 2, else end (found exact match)
    2. test the five string combinations, if not matched, end (no match found), else 3
    3. use your existing algorithm from Len(string)-1 down to 6

    The reason for step 2 is that it prevents you from trying your existing algorithm if there are no possible matches at length 5.  You would proceed to step 3 once you found your first match at step 2.

    There might be other efficiencies you can apply to your search.  I would have to know more about the data before I could make any further performance recommendations.
    LVL 29

    Expert Comment

    Can you upload a sample spreadsheet and also what you expect your results to look like?

    Author Comment

    aikimark, thanks for the suggestions - I like that approach.  I'm setting the program up now to use the .find method for each five character sub string and create a much smaller collection of  strings that I have to search through.  This is the first that I've heard of Levenshtein distance, but I think that may be a more efficient way for me to perform my string comparison.  Do you happen to have a VBA function that I could plug into my code (similar to the pseudo code on the wiki site)

    Sorry IrogSinta, I cannot post the spreadsheet as there is quite a bit of confidential information contained in there.  suffice it to say that the list is about 100,000 items long and is made up of vendor names, the max string length is 45 characters.
    LVL 44

    Expert Comment


    I no longer have that code.  Our local .Net user group has code challenges and one of them was a Levenshtein algorithm.  I based what I did on (meta-)code I found in a Google search.  I optimized it because we were being judged on speed.

    I would expect that you will be using the same character vector at the start of each item comparison.  I don't think it will make much performance difference whether the non-changing item is along the left edge or along the top.  However, I think it will be simpler to code if the non-changing item is along the left side.

    You can also apply the matching to different parts of your list in a parallel fashion.  If you have a quad core processor, you can break your big list into 25k item lists and run the fuzzy matching process on each of the four processors.  You can do this on as many PCs as you have access to.  Even your current algorithm will end in a reasonable time with this approach.

    In your question text, you mentioned hashes.  You could use hashes in the #2 step I've mentioned.  You will generate a long integer hash of the five character substrings of all the items in both lists.  If your max string length is 45, let's assume an average of 40 characters.  You will have (40-5=)35 substrings that you will hash for the big list items.  I don't know how many items you have in the small list, but infer that it is more than 2000 (with unknown length profile).  The hashes will be sorted and indexed for each item in each list.  Instead of iterating the length-five substrings in step #2, you would do a merge match of the two ordered lists of hashes.  You would need an additional 35 MB for the big list hash values.

    Alternative 1: you could have an array of dictionary objects, using the hash values as keys (beware of duplicates, preventing with the .Exists method before adding). You don't have to do any sorting, but there will be more overhead for the dictionary object than just an array and hashed lookups are quick and require less coding than merge match.

    Alternative 2: you could take advantage of the hashing that is built into the dictionary object and store all the possible substrings for the big list items.  You would still use the same algorithm I've described, but use the .Exists method instead of substring.  From a memory and efficiency standpoint, you could populate the dictionary object for each of the big list items as you iterate it and clear the dictionary object at the end of each iteration (.RemoveAll method).  I expect that there will be around 630 dictionary items added for each iteration.

    In all of this, I assume you are keeping a top-5 list of best matches (in the small list) for each item in the big list.

    Author Comment

    Apologies for the delay in accepting your solution - there was a lot here to digest, but this is exactly the type of guidance I was looking for - THANKS.

    I would be interested in learning more about how to split my processing across multiple processors (can I do that from within VB?)  that would save me a lot of time in the future.
    LVL 44

    Expert Comment

    I don't know how your code is packaged.  You need to communicate with the code where it should start in the big list and how many big list items to process.  You might also need to tell it where to put the results, but that might be derived from the other parameter.

    You could manually start the process on different PCs, changing some constants in the code.

    If running on a multi-core PC, you could use the START command, using its /AFFINITY parameter.

    You also have the option of moving these two lists out into separate ASCII files if that helps you distribute the work.  Sometimes, Office documents open exclusive.

    Featured Post

    How to improve team productivity

    Quip adds documents, spreadsheets, and tasklists to your Slack experience
    - Elevate ideas to Quip docs
    - Share Quip docs in Slack
    - Get notified of changes to your docs
    - Available on iOS/Android/Desktop/Web
    - Online/Offline

    Join & Write a Comment

    Improved? Move/Copy Add-in Replacement - How to avoid the annoying, “A formula or sheet you want to move or copy contains the name XXX, which already exists on the destination worksheet.” David Miller (dlmille)  It was one of those days… I wa…
    Workbook link problems after copying tabs to a new workbook? David Miller (dlmille) Intro Have you either copied sheets to a new workbook, and after having saved and opened that workbook, you find that there are links back to the original sou…
    The viewer will learn how to use the =DISCRINV command to create a discrete random variable, use this command to model a set of probabilities and outcomes in a Monte Carlo simulation, and learn how to find the standard deviation of a set of probabil…
    The viewer will learn how to create two correlated normally distributed random variables in Excel, use a normal distribution to simulate the return on different levels of investment in each of the two funds over a period of ten years, and, create a …

    745 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now