Link to home
Start Free TrialLog in
Avatar of Coleen Sullivan
Coleen SullivanFlag for United States of America

asked on

Query Help - possible VBA solution? Array?

Hello Experts,

I need help with an Access query that checks a 'Business_Name' field in a table with approximately 1.3m records, against a table of about 300 corporate identifier codes.  
My query provides results, but they are not complete.  I've included both tables (unjoined) and I'm using an InStr function to get these results.  I have a feeling there is a solution for this but that it will involve VBA, which I am NOT proficient in!

I've provide my SQL below.  FYI -  [business_name_tx] is in table "T_UCVI" and  [Code] is in table "T_BusinessCodes"

My current SQL:  
SELECT DISTINCTROW T_UCVI.ID, T_UCVI.BUSINESS_NAME_TX, InStr(1,[business_name_tx],([code])) AS Corp1, InStr(2,[business_name_tx],(" " & [code])) AS Corp2
FROM T_BusinessCodes, T_UCVI
GROUP BY T_UCVI.ID, T_UCVI.BUSINESS_NAME_TX, InStr(1,[business_name_tx],([code])), InStr(2,[business_name_tx],(" " & [code]))
HAVING (((InStr(1,[business_name_tx],([code])))=1)) OR (((InStr(2,[business_name_tx],(" " & [code])))>1))
ORDER BY T_UCVI.BUSINESS_NAME_TX;

Thanks for all help!
Avatar of aikimark
aikimark
Flag of United States of America image

I've included both tables

I don't see any attachments.
Avatar of Coleen Sullivan

ASKER

I was referring to my query where I included both tables - I did not attach anything here - provided the sql

If some of the names start with spaces, run an update query first to clean up that mess.


What is missing?  We can't see your results.  We have no idea what your data looks like so we have no idea what might be wrong with your query.

I did not attach anything here - provided the sql

Not useful enough for us to help you - we aren't psychic.

Without sample data is worse than guessing...Your data could be everything..

One thing is certain...the Instr is probably a performance killer...

Hi Coleen


   As the others had said, it's a bit difficult to say if there is a better way without seeing an example of the data from both tables.   Just a few rows from each would be helpful.


  With that said:


a. You never want a query without a join if you can avoid it.   Even a join using a WHERE expression is better than nothing.

b.  HAVING is always applied after the records are fetched.   Always try and put any restrictions on the data as a WHERE.


Jim.

My query provides results, but they are not complete.
How do you know that?
Thank you for all the replies.  I've attached a sample database.  
   T_BusinessCodes: Contains 500+ scenarios of potential codes/phrases that would indicate a business name
   T_UCVI: Contains a subset of Business_Names
   q_CpChk: is the query I am intersted in optimizing
   q_delta: displays those records that should have been captured in the q_CpChk query

The ultimate goal is to identify Business Names from Individual's Names.  It is understood this is a first-pass with manual reviews being done on the results, however this should capture more than 90% of the business names.
Query q_Cpchk is capturing most instances.  The reords it doesn't capture (q_delta) contain messy data, i.e. they contain a phrase from T_BusinessCodes in the business_name, but they are not preceeded with, or followed by spaces. Perhaps I should be using something other than InStr?
I know I should never have unjoined tables in a query, I am concerned about crashing though when using 'real data', which is over 1.5m records/ month.   I have in my head this may a good use for any array - but thats as far as my knowledge base goes!
Thanks in advance.
EE_Sample.accdb

Here is one option to test.  I'm going to post another one shortly.

Add a module to your database and paste the following routine:
Option Explicit

Public Function PartialCodeMatch(ByVal parmBusinessCode As Variant) As Boolean
    Dim vParts As Variant
    Dim vItem As Variant
    If IsNull(parmBusinessCode) Then
        PartialCodeMatch = False
        Exit Function
    End If
    vParts = Split(parmBusinessCode, " ")
    For Each vItem In vParts
        If IsNull(DLookup("Code", "[T_BusinessCodes]", "code='" & Replace(vItem, "'", "''") & "'")) Then
        Else
            PartialCodeMatch = True
            Exit Function
        End If
    Next
    PartialCodeMatch = False
End Function

Open in new window


Run this query.  It took about 10 seconds on your sample database.
SELECT T_UCVI.ID, T_UCVI.BUSINESS_NAME
FROM T_UCVI
WHERE PartialCodeMatch([T_UCVI].[BUSINESS_NAME])=True
ORDER BY T_UCVI.ID;

Open in new window

Since we only have about 500 (not 300) codes to check, it was worth exploring a regular expression solution.  This is a much faster solution.

Add this routine to your module.
Public Function RegexMatch(ByVal parmBusinessCode As Variant) As Boolean
    Static oRE As Object
    Dim rs As Recordset
    Dim strCodes As String
    
    If oRE Is Nothing Then
        Set rs = DBEngine(0)(0).OpenRecordset("Select Code from [T_BusinessCodes]")
        Do Until rs.EOF
            strCodes = strCodes & "|" & Trim(rs!Code)
            rs.MoveNext
        Loop
        Set oRE = CreateObject("vbscript.regexp")
        oRE.Global = False
        oRE.ignorecase = True
        oRE.pattern = "\b(?:" & Mid(strCodes, 2) & ")\b"
    End If
    
    If IsNull(parmBusinessCode) Then
        RegexMatch = False
        Exit Function
    End If
    
    RegexMatch = oRE.test(parmBusinessCode)

End Function

Open in new window


Test this query.
SELECT T_UCVI.ID, T_UCVI.BUSINESS_NAME
FROM T_UCVI
WHERE RegexMatch([T_UCVI].[BUSINESS_NAME])=True
ORDER BY T_UCVI.ID;

Open in new window


This solution produces more matches than the first one I posted.  You will need to test and validate the results.
I get error:  "Undefined Function RegexMatch in expression"
Out of curiosity, how many matches does your second solution produce?
My current query q_CpChk produces 4124 results out of 4131 records.  If your solution produces fewer than this, it is not helpful.
As mentioned, the 7 records my query does not produce (q_delta) contain a valid business code within the name, but InStr is not picking them up.
Did you add the routine (code) I posted?
Did you save the module after posting the code?

The second produces 3425 rows.  The first solution produces 3333 rows.

Please invoke that second solution and post a row that your solution finds that neither of my solutions returns.
All code is added and module saved -
You can see the results of my query by running q_CpChk
Since I can't run your query, I cannot provide the delta for you
Since I can't run your query
You are receiving a Undefined Function RegexMatch in expression?
correct
Please post the sample database where you've pasted the regexmatch function
EE_Sample.accdb
Your query: q_eeSolution2
Module: Mod_ee
thanks
1. Mystery solved.  You created a Class Module, not a regular module.  We need to resolve your exact matching requirements.

2.  Are you looking for exact word matches?
In your original example, rows 4-5 match the "Usa" code, but that only matches the start of the word in the business name.  Is that what you really need?  I have been trying to match the entire word.
BUSINESS_NAME
**************N├O USAR******************
*NON Usare

Open in new window


===========================
Tweaking the regex pattern to only match the leading characters of the word returns 4124 rows.

Please test the attached database.
EE_Sample--1-.accdb
I am using the InStr function to find ANY occurrance of [code] within the business_name field, regardless of start-location. I'm not clear how this could be misunderstood.
I'll have a look at your database tomorrow, although it sounds as if your solution is just producing the same results as my original query - and it is still not capturing the 7 other records,.

Just to understand something...the problem is the No or records ...or the speed?
And for clarification how many T_BusinessCodes you expect to have ...if they are just 500 maybe things could get some speed.
Which 7 records/rows are you missing?
I'm not clear how this could be misunderstood.
For me, it isn't a matter of "understanding", rather it is a matter of "why would someone do such a thing in the first place".  I'm sure that you or your client has a reason.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you!
Your last sample captures as many records as I believe we can expect -and- it is much faster than my query - great!
The two records it does not capture are:
     4128. name S.R.L
     4130. name Sαrl            aux mains de Mr name

Record 4128 has a punctuation error - missing the last ".".  I can accommodate for this in the T_BusinessCodes table - not a big deal.
Record 4130 contains the word "Sarl" but it's not being picked up. I'm thinking the "a" is being interpreted as another character based on the font for that one character. The input for this data can be from anywhere in the world, so there is a possibility some characters don't translate properly I suppose.
Bottom line - this is a much faster and more efficient solution than what I had, thank you.
Since some of your codes have period characters, please replace your current RegexMatch routine with this one.  It forces the period in character(s) in the code to match a period character.
Public Function RegexMatch(ByVal parmBusinessCode As Variant) As Boolean
    Static oRE As Object
    Dim rs As Recordset
    Dim strCodes As String
    
    If oRE Is Nothing Then
        Set rs = DBEngine(0)(0).OpenRecordset("Select Code from [T_BusinessCodes]")
        Do Until rs.EOF
            strCodes = strCodes & "|" & Replace(rs!Code, ".", "\.")
            rs.MoveNext
        Loop
        Set oRE = CreateObject("vbscript.regexp")
        oRE.Global = False
        oRE.ignorecase = True
        oRE.pattern = Mid(strCodes, 2)
    End If
    
    If IsNull(parmBusinessCode) Then
        RegexMatch = False
        Exit Function
    End If
    
    RegexMatch = oRE.test(parmBusinessCode)

End Function

Open in new window


================================
You are correct that the business name in row 4130 is not in the standard ASCII set.  It is an extended ASCII character with a value of 225.
It is possible to match extended ASCII characters as well as unicode characters (> 255).  However, it isn't a trivial task.
Thank you. I replaced the RegexMatch routine as you suggested, however it is still not picking up record 4128.
I beleive I understand what what this line of code is doing, but would you please clarify for me?
     strCodes = strCodes & "|" & Replace(rs!Code, ".", "\.") 

Thanks also for the explanation on the extended ASCII character set.  This is not a huge issue at this time, so I will leave it for now.
It is possible I may revisit it in the future though.
Replace(rs!Code, ".", "\.") 

Open in new window

In the above piece of code, I'm replacing every period character with two characters ("\."), a backslash followed by a period character.

In regex patterns, the backslash character is an escape for the character that follows it.  This is necessary in cases where the character we want to match is one of those we use for regex pattern definitions.  For instance, if we wanted to match parentheses characters or square bracket characters, we would precede them with a backslash character.

===============================
If you need to match the extended ASCII set, you would need to change the (mostly) vowels in the codes to accommodate differently accented vowels in the 192-255 range.  Even though the regex object I'm instantiating ignores case, it won't address this (extended ASCII) character encoding issue.  Here are the extended Latin letter characters (reference: https://www.ascii-code.com/ ).

Latin capital letter A with grave
Latin capital letter A with acute
Latin capital letter A with circumflex
Latin capital letter A with tilde
Latin capital letter A with diaeresis
Latin capital letter A with ring above
Latin capital letter AE
Latin capital letter C with cedilla
Latin capital letter E with grave
Latin capital letter E with acute
Latin capital letter E with circumflex
Latin capital letter E with diaeresis
Latin capital letter I with grave
Latin capital letter I with acute
Latin capital letter I with circumflex
Latin capital letter I with diaeresis
Latin capital letter ETH
Latin capital letter N with tilde
Latin capital letter O with grave
Latin capital letter O with acute
Latin capital letter O with circumflex
Latin capital letter O with tilde
Latin capital letter O with diaeresis
Latin capital letter O with slash
Latin capital letter U with grave
Latin capital letter U with acute
Latin capital letter U with circumflex
Latin capital letter U with diaeresis
Latin capital letter Y with acute
Latin capital letter THORN
Latin small letter sharp s - ess-zed
Latin small letter a with grave
Latin small letter a with acute
Latin small letter a with circumflex
Latin small letter a with tilde
Latin small letter a with diaeresis
Latin small letter a with ring above
Latin small letter ae
Latin small letter c with cedilla
Latin small letter e with grave
Latin small letter e with acute
Latin small letter e with circumflex
Latin small letter e with diaeresis
Latin small letter i with grave
Latin small letter i with acute
Latin small letter i with circumflex
Latin small letter i with diaeresis
Latin small letter eth
Latin small letter n with tilde
Latin small letter o with grave
Latin small letter o with acute
Latin small letter o with circumflex
Latin small letter o with tilde
Latin small letter o with diaeresis
Latin small letter o with slash
Latin small letter u with grave
Latin small letter u with acute
Latin small letter u with circumflex
Latin small letter u with diaeresis
Latin small letter y with acute
Latin small letter thorn
Latin small letter y with diaeresis


Both the standard and extended alternatives would need to be placed into the code we use for the pattern.

Example:
Code = Sarl
matching regex pattern = S[aÀÁÂÃÄÅàáâãäå]rl

As I stated in an earlier comment, it can be done but it isn't trivial.
Thanks. Yes, I will definitely leave trying to match the extended ASCII characters alone for now!
Thanks also for your explanation regarding the line of code:  
strCodes = strCodes & "|" & Replace(rs!Code, ".", "\.")
In regex patterns, the backslash character is an escape for the character that follows it.  This is necessary in cases where the character we want to match is one of those we use for regex pattern definitions.
I thought you were trying to capture record 4128 which was missing the last period. But if it's not there, how can you test for it, right?
All good. Thanks again for your help and insight.


SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial