We help IT Professionals succeed at work.
Troubleshooting Question

Revising a Regex Function: MS Access

47 Views
Last Modified: 2020-11-20
DD_Proj_Ex.accdb
Hello Experts,
I need help revising a regex function that was created for me earlier this year by aikimark: https://www.experts-exchange.com/questions/29174458/Query-Help-possible-VBA-solution-Array.html .  The link points to the original solution however the attached db contains an updated version of the function.
 
The purpose of the function is to identify records in [T_MMA].[Business_Names_TX]  with business/corporate names –vs- individual/proper names, based on a table of business codes, [T_BusinessCodes].[Code], which consists of abbreviations, single words, or phrases.  This table is updated frequently based on new business names that are not being captured.  
 
I have been running this function every month on data from non-US countries, which have consisted (mostly) of true corporate names.  This month however, I have a set of US data with many individual/proper names and I realize something is not right as the function is capturing every record.  
 
I believe the VBA code is currently written to match the [Code] “anywhere within” [Business_Names_TX].  For more accuracy however, I want to change this so that a match will occur only if the entire [Code] is found in [Business_Names_TX], using spaces to isolate the [Code] in one of three scenarios:
 
Beginning of Field:      [Code]_space_*any character*
Within Field:                *any character*_space_[Code]_space_*any character*
End of field:                 *any character*_space_[Code]

To be clear, there should be no adjustment to the Codes [Code] in the T_BusinessCodes table.

Query ‘q_MMA_Updt’ in the attached file shows those records the Regex function is identifying as ‘Corp’ but that I have identified (manually) as ‘Ind’.  Of the first 300 records that are selected, I see only 3 that should be included:  
110432: “– ingaged”;
129319: “citizens”;
131249: “custom” or “components”.
 
Thanks in advance for any help.
Coleen
Comment
Watch Question

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
looking at it.  thanks for posting
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
First, I'm going to streamline the SQL.
SELECT T_MMA.ID, T_MMA.BizType, T_MMA.BUSINESS_NAME_TX, 
       T_MMA.BizCd, "Corp" AS NewCd

FROM T_MMA

WHERE (((T_MMA.BizType) Is Null) AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((InStr(1,[business_name_tx]," "))>0) AND 
((RegexMatch([T_MMA].[BUSINESS_NAME_TX]))=True))) 

OR 

(((T_MMA.BizType) Not Like "*D*") AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((RegexMatch([T_MMA].[BUSINESS_NAME_TX]))=True)));

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
You have over 6.7 times as many patterns that you did in the original question.  I need to do some testing to see what is getting generated.  There may be a scaling issue here.  I don't know at this poing.
Coleen SullivanIndependent Contractor

Author

Commented:
Yes, the number of codes has increased significantly.  I wondered if that had something to do with the problem.

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I think I had originally created a pattern that matched on word-letter breaks (\bpattern\b).  I might need to return to that.

Also, there might be some duplicate code patterns that could be eliminated.  I'll need to check that as well.
Coleen SullivanIndependent Contractor

Author

Commented:
OK.  I don't recall seeing  (\bpattern\b) but if it works - great!  I know I originally wanted to match the "start" of any word in [business_name_tx]  with [code], so that I could reduce the number of code patterns to use, but I was coming up with too may incorrect matches, i.e. "France" was matching "Frances", etc.  and many of the 2-character abbreviations were matching a multitude of records, so I think you tried to allow for that.  
There shouldn't be any duplicates in T_Businesscodes.Code; I have a query to use only unique values in my main db.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
The VBA code does some cleaning up of the business code text.  I think it is possible that there may be duplicates after this clean up.
Coleen SullivanIndependent Contractor

Author

Commented:
What is it doing?  
I'm leary of any adjustment to the codes aside from adding preceding or trailing spaces...
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I don't recall seeing  (\bpattern\b)

The \b in regular expressions denotes a boundary between a word character (0-9A-Za-z_) and a non-word character.  This should handle beginning, middle, and end-of-string matches.  However, if the business code pattern begins or ends with a non-word character, then I can't use \b to delimit the pattern.
Coleen SullivanIndependent Contractor

Author

Commented:
I need to match on the entire code
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
What is it doing?

I need a bit more time with the code before I can answer that question.  The original question was 8 months ago.  I need to re-familiarize myself with the code.
Coleen SullivanIndependent Contractor

Author

Commented:
OK... I think the \b should work.  If a non-word character is a space, then that's fine as I have removed all preceding spaces in the code table, thinking it would be better to build that in to the VBA code to capture the three scenarios of the code pattern as I outlined above.

Coleen SullivanIndependent Contractor

Author

Commented:
I need a bit more time with the code before I can answer that question.  The original question was 8 months ago.  I need to re-familiarize myself with the code. 
Ok. Yes, this may be the issue.  There should be no 'clean-up' of the codes in the BusinessCodes table.  

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Here are the first seven codes.  None of them start with a word character.  The last one neither begins nor ends with a word character.  I don't know if the code removes these or keeps them.  Certainly \b boundaries won't work with such patterns.
- Ben
- BONDS
- ingaged
- ME
& Ass
& Son
(ANDANCE)

Open in new window

Coleen SullivanIndependent Contractor

Author

Commented:
Ah - and there is no option to build in a way to treat these characters?  I remember you posted a supplement to the original solution that was in regard to how to handle text within parentheses - I will find it and copy here for you.

Coleen SullivanIndependent Contractor

Author

Commented:
Last few posts in this link are in reference to the supplement code you posted:
https://www.experts-exchange.com/questions/29176495/Help-with-a-solution-provided-on-3-4-20.html
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I think the questions I had were related to matching the words versus matching the entire business code string.  Thanks for the additional link.  Additional discussion should help with the context and my understanding of the problem and solution(s).
Coleen SullivanIndependent Contractor

Author

Commented:
Yes.  Since the VBA code wasn't working as written, I now want to match on the entire code-string anywhere in the business_name, using spaces to isolate the string, so that 
Code: Painting 
will match:
Painting Supplies [Start of string: followed by a space and anything else]
Supplies for Painting [End of string: preceded by a space and anything before the space]
Supplies for Painting and  [Middle of string: surrounded by spaces]
Will not match:
Paint

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
There are 349 business codes (see below) with non word (ASCII) characters, other than an internal space character.  Should we match the business code string or the string without the punctuation?

Some of these codes contain upper range ASCII characters.  Codes such as "recupΘrer" and "Café-Bar" may be foreign or may have been changed during the importing process.  I mentioned the high range (foreign) characters in a comment in the prior questions.

Given your current specification, using a space character as the delimiter, I can change the code to work.  However, I'd like you to agree on how the (below) codes should be treated before I roll up my sleeves.

The current VBA code looks for characters that are used by the regex engine to specify the regular expression pattern.  Currently, I am changing all period, open parenthesis, and closed parenthesis characters, prepending a backslash character.  This is because these three characters have special meaning to the regex engine.  The prepended backslash character 'escapes' these characters, causing the regex engine to treat them as characters rather than as part of a pattern (wildcard character, start of a group, end of a group).

Looking through the list, I will need to add a plus sign character to the characters that I escape.
- Ben
- BONDS
- ingaged
- ME
& Ass
& Son
(ANDANCE)
(Botswana)
(CALAIS)
(CAMBIO RUT)
(CHOMERAC)
(COGNY)
(EN FAILLITE
(FOX
(Fuji
(FZC
(GENEVE)
(GENTILLY)
(GUINGAMP)
(HOA)
(In Faling
(Midas)
(MMA
(NCCNP)
(NOAO
(Norway)
(Payment
(Publ
(PwC
(SIBELCO)
(SVAF
(Switzerland)
(TMC
(UK
(USNR
.COM
/Sons
A recupΘrer
A rΘcuperer
A rΘcupΘrer
A/B
A/S
A+
ACE -
ACI-PRENSA
ACN / FNIB
ACS -
AD'APPRO
ADDD/ADDS
ADMIN.
AEG -
AEMB/VBAM
AIG -
Akant-Prof
Al - NASR
All-Pro
Amb.
Anheuser-Busch
Art'Otel
Ass.
Az.
B & B
B&B
B.A.
B.E.C.F.
B.U.S.O.
B.V.
B.V.B.A.
Banker's
Bay.org
Bed & Breakfast
Bee's Keys
BEH -
Bio-
Bonnie's Kids
Brewin'
By-Rite
C. POR A.
C. V
C. V.
C.A.
C.B.
C.C.
C.C.I.A.A
C.E.P.S
C.I.S
C.O.T.B.i
C.S.B.M
C.V.
c/o
Café-Bar
CAP'ENFANTS
Capt/Own
CAR-GAS-TRANS
CDAD.
CHEF D`
Children's Fund
Children's home
CIG -
CNI-
Co.
Coca-Cola
COFITEM-COFIMUR
COL.OF
Comunitα
Condo.
Cons.
Constr&Eng
contabilitα
Co-op
CPE-BARDOUT
CSV.
D & S Trim
D.I.Y.
d.o.o
d.o.o.
D.W.PLASTICS
D/B/A
DAM/
D'ARCHITECTURES
D'AUBY
DE PROP.
Deai/
Defi/
Dem-Pay
Depart.
Dept.
D'Excellence
DIELIX -
Dipl.
Diving & marine
E & O
E&O
E.A.E.S.
E.I.R.L
e.o.
E.P.S.
EDF/
Edif.
ENG.
Epp.
EST.
Exp_Motor
Feed &
Feed & Grain
Fidal R/
Fiduc.
Flats/
Fonsai -
FRAME -
Franτaise
G.B.R.
G.K.
G.M.B.H
GEMEENSCH.
Gentlemen's
Ges.m.b.H
Ges.m.b.H.
Geval - UO
Gez. Eig.
Gleis-Genossenschaft Ristet-Bergermoos
GNLI -
Graaff-
Graaff -
Grill & Pub
HB JARDINS D'AZUR
HDI -
HELIODROMUS,SE
Hope's Inn
Hosp.
I.D.F.E.E.V.
I.K.E
I.R.Q
ICA Nµr
ICA.
ICC -
IDF)
IMM.
IMMOB.
Inc.
ing.
Ing.arch.
ingaged -
Inh.
Inn & Spa
Ins.
INST.
INT.
InterpetrolS.r.l.
Int'l
inv.
Itinera/CIMOLAI
JEUDY MARTIN - T.J.M. TPS
Johnny Gryll's
K.K.
k.s.
Karl's Coins
KENF (ABLIS)
Kft.
KGWN-TV
KNEP-TV
Kurier-team
KWA-ZULU
L.L.C
L.L.C.
L.L.P.
L.P
L.P.
L.T.D.
Land-
-Law
Law, APC
L'EAU VIVE
L'Entre Deux
L'Internation
Loc.
L'Oreal
Ltd.
m.b.h
M.E.
M.I.B
M.P.I
Marista -
Mazda-Volkswagon
Meds & Food
Menighedsrσd
Michelle's Shells
Mondi-Pak
Mt. Pleasant
Multi-
N.A.T.O
N.G.Kerk
N.V.
NCV-NEBON
Nicole's Kids
Norte -
NV/SA
O.E.
O.P.S.
O.Y.
O/A
OGK-5
Oper/
Other -
P&C
P. A
P. A.
P. C
P. C.
P.A.
P.C.
P.T.
P.V.B.A.
Painting &
Pens.
PF OFG SDN. BHD.
Pick 'n Pay
Pipe-Roll
Prev.
PRICK 'N STITCH
PROME Y/O
Prot.
PROV.
QSR,inc
Q-SYS BV
R&D
R.V.T
RAPPEL/
RAPPEL/FRAIS
Redi-Box
Res.
RMS -
RSA -
s r.o.
S.A.R.L.
S.A.S.
S.C.
S.E.C
S.L.
S.L.P.
s.n.c
S.P.
S.P.A.
S.P.R.L.
S.R.L
s.r.o
S.R.O.
S.α.r.l.
S/A
Sµdding Hede
SAP (
Schmidt Rebell-Houben
Sci-fi
SDN. BHD.
Seav -
Serv.
Sevia -
SITH - SIEGE
SMI -
SNAVEB -
SOA -
Soc.
SODI -
Spencer Gas-
St, Louis
St.
St. Louis
St. Vincent De Paul
Super Subby's
Swift/Merieux
SWK -
Syner'val
Sαrl
T.L.C
T.R.S
T.R.T.C.
T.S.A.
T.T.R. ARAKEL
T/A
Tais -
Tek-Este
T-Icy Roll
Tool & Die
Tri-Star
Trygg-Hansa
TTS/LTS
Tungs-
U.A.E
U.A.E.
U.C.C.SA
U.T.E
V.B.R.
V.O.F.
V.V.
V.Z.W.
VEF -
Ver.
Veri -
Very Rev.
Vorl.
W. M. C.
W.E.P.A.
Wal-mart
Wind & Spirits
Wine & Spirits
Witch's Brew
Wolff's Den
Work Now!
Zrt.

Open in new window

Coleen SullivanIndependent Contractor

Author

Commented:
Thanks for the explanation.
I need to match the entire string as entered in the table.  Otherwise, we risk matching proper names.  Although there are 300+ codes that contain non ASCII characters, there look to be only about 10 or so of those characters.
Regarding the upper-level ASCII characters, yes, these are foreign words.  If these cannot be addressed, we will have to leave them for now. I'll need to keep these in a separate table and run a supplement query for these.

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
How performant does this query need to be?
Coleen SullivanIndependent Contractor

Author

Commented:
I'm not sure I understand?  
What other option is there?

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
The query is taking a while to finish.  If waiting for results is acceptable, then I don't need to concern myself with performance.  If someone is going to be sitting at the keyboard, waiting for results, then performance might be an issue.  This is especially true if the person is a C-level executive.
Coleen SullivanIndependent Contractor

Author

Commented:
How long is a while?  
It was previously taking approximately 20 minutes on 2mm records in the main db.
I am the only one using it at this time, so I can plan around run-time, but if it is taking more than 30 minutes for 250k records, I wont be able to apply this on my full set of data.
Coleen SullivanIndependent Contractor

Author

Commented:
I have an idea.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
The table has ~2.8 million rows.
About 2234 rows are returned by the query that I'm testing (see above). When I drop down to the last row, it is taking a little less than 6 minutes.

I'd like to go over the selection logic for the criteria on both sides of the OR keyword.
Coleen SullivanIndependent Contractor

Author

Commented:
Lets break the function into three separate functions as
  1. Searches for the [Code] at the beginning of the field only, followed by a space
  2. Searches for the [Code] in the middle of the field, surrounded by a space
  3. Searches for the [Code] at the end of of the field, preceded by a space
I'll create subsequent queries to grab the true results from each of these and then pull back together in a final query.

Coleen SullivanIndependent Contractor

Author

Commented:
The table has ~2.8 million rows. 
No, T_MMA has 284,905 records.  My main db has over 2mm records.  How long did the query take to run on the 284,905 records?
Which line of code are your referring to with the OR statemet?
FYI - I included some criteria in this sample to isolate the records that were an issue as compared to my manual effort of updating the business code.  
For example don't worry about:
    T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND  
T_MMA.BizCd) Like "Ind*" AND  
InStr(1,[business_name_tx]," "))>0)
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
The beginning/middle/end conditions aren't a concern.

I'm running this query:
SELECT T_MMA.ID, T_MMA.BizType, T_MMA.BUSINESS_NAME_TX, 
       T_MMA.BizCd, "Corp" AS NewCd

FROM T_MMA

WHERE (((T_MMA.BizType) Is Null) AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((InStr([business_name_tx]," "))>0) AND 
((RegexMatch([T_MMA].[BUSINESS_NAME_TX]))=True))) 

OR 

(((T_MMA.BizType) Not Like "*D*") AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((RegexMatch([T_MMA].[BUSINESS_NAME_TX]))=True)));

Open in new window


This is the RegexMatch() function that the query is invoking:
Public Function RegexMatch(ByVal parmBusinessCode As Variant) As Boolean
    Static oRE As Object
    Dim rs As Recordset
    Dim strCodes As String
'    Dim strWholeWordCodes As String
'    Dim strDelimitedCodes As String
    
    If oRE Is Nothing Then
        Set oRE = CreateObject("vbscript.regexp")
        oRE.Global = True
        oRE.Pattern = "([.()+])"
        
        Set rs = DBEngine(0)(0).OpenRecordset("Select Trim(Code) As trim_code from [T_BusinessCodes]")
        Do Until rs.EOF
            strCodes = strCodes & "|" & oRE.Replace(rs![trim_code], "\$1")
            rs.MoveNext
        Loop
        
        oRE.Global = False
        oRE.ignorecase = True
        strCodes = Mid(strCodes, 2)
        strCodes = " (?:" & strCodes & ") "
        oRE.Pattern = strCodes
    End If
    
    If IsNull(parmBusinessCode) Then
        RegexMatch = False
        Exit Function
    End If
     
    RegexMatch = oRE.test(" " & parmBusinessCode & " ")

End Function

Open in new window

Coleen SullivanIndependent Contractor

Author

Commented:
I see the problem. Line 9 should be repeated on the other side of OR, before the Regex line, so there should be 18 lines of code.
Coleen SullivanIndependent Contractor

Author

Commented:
I had been testing before sending and obviously didn't put that back in... sorry.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
You are right.  The table has 284905 rows.  Blame my old eyes for the prior count.

It took between 5 and 6 minutes to return all the rows that met the criteria.  If you invoke the query against 2.8 million rows, you are looking at 50-60 minutes for the full-sized run.
Coleen SullivanIndependent Contractor

Author

Commented:
LOL - I'm in same boat with my eyes today ;)
My db 'only' has 2mm rows, so I would guess it should be around 40 mins to run, 2x longer than currently, but I can work with that.
I would love it if you could comment the code lines, so that I understand more clearly what they're doing.  
I'm still wondering if it might be faster to run the three checks separately, so it's only reading the name field once, instead of three times?

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I added the line to the query and it still ran in the 5-6 min time.  Still getting 2234 rows returned.

What I've noticed is that there are full business names in the business codes table.  Is there a way for you to discriminate between full business names and "codes"?

I don't see any rows with a Null BizCd value.  Am I supposed to see any with this sample of data?
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
wondering if it might be faster to run the three checks separately
Maybe.  I combined the two expressions on both sides of the OR and reran the query.  It only took about 4 minutes to reach the bottom.

This is the query I just tested.
SELECT T_MMA.ID, T_MMA.BizType, T_MMA.BUSINESS_NAME_TX, T_MMA.BizCd, "Corp" AS NewCd

FROM T_MMA

WHERE (
(((T_MMA.BizType) Is Null) OR ((T_MMA.BizType) Not Like "*D*")) AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((InStr([business_name_tx]," "))>0) AND 
((RegexMatch([T_MMA].[BUSINESS_NAME_TX]))=True))
) 

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Ah Ha!  The business_name_tx field is a memo data type.  Did any of the businesses have names longer than 255 characters?
Coleen SullivanIndependent Contractor

Author

Commented:
What I've noticed is that there are full business names in the business codes table.  Is there a way for you to discriminate between full business names and "codes"? 
I've included full business names in the table to avoid pulling unwanted records on a truncated version of that name.  Is there a benefit to discriminating between the two?  I could add a field to discriminate.
I don't see any rows with a Null BizCd value.  Am I supposed to see any with this sample of data? 
No - I updated BizCd (manually as best I could) so that I could display how the Regex function was including Individual names.  The BizCd's you're seeing now on the 2234 rows should all be 'Corp".  If not, no worries, I've been updating the BusinessCodes table all day, so there will be some that have not been caught with the data you have.

Coleen SullivanIndependent Contractor

Author

Commented:
Ah Ha!  The business_name_tx field is a memo data type.  Did any of the businesses have names longer than 255 characters? 
With this data... I wouldn't be surprised,  I didn't even notice that data type changed.

Coleen SullivanIndependent Contractor

Author

Commented:
Did you post the updated function, or are you still working on that?
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
There are only 36 matches of entire business names.  These results come back within a matter of seconds.

Note: I took the left-most 255 characters of the business name for this test.

ID     BizType	NewCd	Code
21611		Corp	Teeny Pies
27891		Corp	Industrie
22577		Corp	Right Up My Ali
87380		Corp	Sky River
108913		Corp	RC TEX
88113		Corp	Queer SD
114124		Corp	One Fam
114966		Corp	Knights of Columbus
152404		Corp	North Penn
217123		Corp	Nicole's Kids
200795		Corp	OnPoint Digs
238975		Corp	Gardner Bay
306507		Corp	Cie Newcor
265610		Corp	Knights of Columbus
328633		Corp	Cape GOP
328944		Corp	INDIVIDUAL
356064		Corp	Inside Out Works
544214		Corp	BASF
631115  Reg_MF	Corp	Afiniti
745104		Corp	Diamond
767526		Corp	Sunlink
810947		Corp	Pro Ag
826982		Corp	Nextel
913787		Corp	Coopers Creek
985166		Corp	Shake it
1022762		Corp	Q-SYS BV
1066258		Corp	Kill Brand
1069264		Corp	Michelle's Shells
1153870		Corp	Wonder Works
1176915		Corp	Testing
1304684		Corp	EZ Sleep Inn
1305642		Corp	Bio Regen
1284217		Corp	Muscle Powered
1289357		Corp	Greystone Power
1392724		Corp	Edmunds JV
1656541		Corp	Karl's Coins

Open in new window

Coleen SullivanIndependent Contractor

Author

Commented:
I see it -

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I am still testing the same function that I posted earlier.  Only the query has changed.
Coleen SullivanIndependent Contractor

Author

Commented:
I'm not clear why whether the 'code' contains the full name or not matters?
Coleen SullivanIndependent Contractor

Author

Commented:
I am still testing the same function that I posted earlier.  Only the query has changed. 
OK, Thanks.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Here is the commented routine.  I renamed the parameter to be more clear about what I'm doing.

Public Function RegexMatch(ByVal parmBusinessName As Variant) As Boolean
    Static oRE As Object
    Dim rs As Recordset
    Dim strCodes As String
    
    If oRE Is Nothing Then      'only do this once in the life of the open database
        Set oRE = CreateObject("vbscript.regexp")
        oRE.Global = True       'look at all possible matches
        oRE.Pattern = "([.()+])"   'characters that need escaping
        
        Set rs = DBEngine(0)(0).OpenRecordset("Select Trim(Code) As trim_code from [T_BusinessCodes]")
        Do Until rs.EOF
            'clean up and concatenate the codes
            strCodes = strCodes & "|" & oRE.Replace(rs![trim_code], "\$1")
                                        'the Replace is escaping the characters
                                        'with a backslash
            rs.MoveNext
        Loop
        
        oRE.Global = False              'stop looking with the first match
        oRE.ignorecase = True
        strCodes = Mid(strCodes, 2)     'skip the leading pipe character
        strCodes = " (?:" & strCodes & ") "     'put finishing touches on the pattern
        oRE.Pattern = strCodes
    End If
    
    If IsNull(parmBusinessName) Then
        RegexMatch = False
        Exit Function
    End If
    
    'prepend and append a space to the business name and invoke the
    'regex .test() method
    RegexMatch = oRE.test(" " & parmBusinessName & " ")

End Function

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
why whether the 'code' contains the full name or not matters
Exact string matching is much faster than invoking a user-defined function.  A test of this inner join with the left 255 characters of the business name and the code column in the business codes table runs in mere seconds.

From a performance perspective, if you know that some business codes are found more often than others, we can test for those codes first.

One of the configurations of this solution would be to break up the pattern into multiple regex objects in a collection, each containing a subset of the pattern.  For each business name, iterate the collection until we get a match or until we exhaust the collection.  I don't know how long this big regex pattern is taking compared to multiple smaller patterns.
Coleen SullivanIndependent Contractor

Author

Commented:
Thank you.  Please identify where it checks for the code in the beginning of the string and at the end of the string,  I only see where it is checking for the 2nd scenario (spaceCodeSpace)

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Every business name, passed to the function, is prepended and appended with a space character.

The final 'wrap up' of the pattern includes a leading and trailing space character.  This should handle all three scenarios.
Coleen SullivanIndependent Contractor

Author

Commented:
One of the configurations of this solution would be to break up the pattern into multiple regex objects in a collection, each containing a subset of the pattern.  For each business name, iterate the collection until we get a match or until we exhaust the collection.  I don't know how long this big regex pattern is taking compared to multiple smaller patterns. 
This is what I was thinking about earlier, but running on the three scenarios, rather than on a true full-name -vs- a code.  With over 2mm records, this subset of 4000 'codes' is the best I can do at this point.  I like the idea of the collection of regex objects. As mentioned earlier, I would like to have three regex objects as:
  1. Matching [Code] at the beginning of the field only, followed by a space
  2. Matching [Code] in the middle of the field, surrounded by a space
  3. Matching [Code] at the end of of the field, preceded by a space

Coleen SullivanIndependent Contractor

Author

Commented:
Every business name, passed to the function, is prepended and appended with a space character.
The final 'wrap up' of the pattern includes a leading and trailing space character.  This should handle all three scenarios. 
Will this logic find:
" Co."  in the string: The Campbell Soup Co.
"Painting" in the string: "Painting and Supplies"
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Will this logic find:
" Co."  in the string: The Campbell Soup Co.
"Painting" in the string: "Painting and Supplies"
It should.  As I am building the pattern, I Trim() the codes as I iterate the business code table.  This will transform
" Co." to "Co\."

When "The Campbell Soup Co." is passed, it is transformed to " The Campbell Soup Co. "
" (Co\.) " should match.

Similarly, when "Painting and Supplies" is passed, it is transformed to " Painting and Supplies "
" (Painting) " should match.
Coleen SullivanIndependent Contractor

Author

Commented:
It has been my experience that they will not match
Coleen SullivanIndependent Contractor

Author

Commented:
The way this is currently written will miss all codes at the beginning of the string and those at the end.  This will only find the codes that have preceding and trailing spaces, i.e. those in the middle of the string
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Please test what I have posted (query and function).
Coleen SullivanIndependent Contractor

Author

Commented:
OK great -  it looks like its catching the codes at the end. I get different results when adding a space to a field in sql, which is why I was concerned. I'll check the beginning of the strings in the morning and run more tests.  Will also think more about how to break up the codes to run in different regex objects for speed performance.
Thanks VERY MUCH for all your help!
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Hypothetically, the following should run a bit faster.  I'm breaking up the logical expressions into those that are purely SQL (the inner query) and those that require the invocation of the user-defined function (the outer query).

The inner query returns 155K rows, so we have potentially eliminated 130K invocations of the user-defined function.

SELECT I.ID, I.BizType, I.BUSINESS_NAME_TX, I.BizCd, "Corp" AS NewCd

From (

Select ID, BizType, 
       BUSINESS_NAME_TX, BizCd

FROM T_MMA

WHERE (
(((T_MMA.BizType) Is Null) OR ((T_MMA.BizType) Not Like "*D*")) AND 
((T_MMA.BUSINESS_NAME_TX) Not Like "*[0-9]*" AND 
((T_MMA.BizCd) Like "Ind*") AND 
((InStr([business_name_tx]," "))>0) 
) 
)
) As I

WHERE RegexMatch(I.[BUSINESS_NAME_TX])=True

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Multiple regex objects would require you to know your data.

For instance, we know that companies that include "Inc" and "AB" should match more companies than "S.P.R.L." or "R.V.T".  So, we would use that knowledge to reorder the codes by their matching frequency.  Business codes that only match a single company (exact match on company name) would be at the bottom of the list.
Coleen SullivanIndependent Contractor

Author

Commented:
Thanks, I need think more about grouping the codes - good to know though.
155k sound like too many rows, but I'll review in the AM.
Thanks again.

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
The 155K rows was reduced down to the 2234 rows by the user-defined function.

I wrote a quick routine to find the most frequent words.  I used 600 as my cutoff value.  I used the Split() function, so punctuation is included.  There are 128K unique 'words' in your sample data.

Here is that list:
45470	&
37967	LLC
27962	INC.
11014	AND
10450	INC
5999	OF
4939	JOHN
4722	DBA
4416	JAMES
4380	ROBERT
4175	MICHAEL
3972	THE
3846	DAVID
3624	A.
3463	J.
3430	WILLIAM
3426	M.
3268	A
2992	L.
2906	THOMAS
2753	J
2654	M
2601	MARY
2467	RICHARD
2443	COMPANY
2390	L
2380	
2349	E.
2179	CONSTRUCTION
2124	R.
2022	CORPORATION
2008	JR.
2005	C.
2000	SERVICES
1949	JOSEPH
1930	ASSOCIATION
1855	GROUP
1837	D.
1809	W.
1800	R
1789	CHARLES
1784	SERVICES,
1762	C
1726	MARK
1718	E
1664	SUSAN
1654	D
1652	S.
1580	GROUP,
1546	PAUL
1533	S
1456	SMITH
1453	LINDA
1440	PATRICIA
1425	SCOTT
1422	DANIEL
1409	MANAGEMENT
1406	TRUCKING
1398	MR.
1395	COMPANY,
1356	B.
1356	B
1331	CENTER
1318	BARBARA
1316	W
1305	H.
1305	ELIZABETH
1303	ANN
1228	JOHNSON
1215	JR
1195	P.
1194	H
1189	III
1166	STEVEN
1165	MD
1156	G.
1143	STEPHEN
1136	CHRISTOPHER
1124	NANCY
1067	HEALTH
1065	KAREN
1061	BRIAN
1055	JEFFREY
1051	GEORGE
1033	F.
1031	G
1026	DONALD
1026	PROPERTIES
1019	MATTHEW
1014	JENNIFER
1009	T.
1003	GARY
999	DR.
998	COUNTY
994	SCHOOL
990	CAROL
989	P
974	ANDREW
965	FAMILY
965	LEE
953	CONDO
946	SERVICE
946	HOLDINGS,
940	T
927	ASSOCIATES
922	EDWARD
920	CONDOMINIUM
918	RONALD
915	PROPERTIES,
912	TRUST
909	MEDICAL
902	NORTH
899	KEVIN
879	MILLER
870	PETER
867	K
862	K.
854	KENNETH
852	CARE
846	CITY
843	ENTERPRISES
841	MARGARET
838	ANTHONY
835	LISA
825	COMMUNITY
818	ESTATE
814	CONSTRUCTION,
814	HOME
814	FOUNDATION
809	WILLIAMS
798	CO.
798	TIMOTHY
796	KATHLEEN
795	JONES
793	FOR
783	ENTERPRISES,
780	AT
775	BROWN
774	CLUB
771	F
771	-
769	SANDRA
767	CO.,
764	SOLUTIONS
760	TRANSPORT
754	DEVELOPMENT
752	CHURCH
743	FRANK
738	DEBORAH
736	LTD
735	DONNA
733	ASSOCIATION,
729	CORP
723	CO
722	MARTIN
713	ASSOCIATES,
713	LP
706	PARK
698	HOMEOWNERS
698	GREGORY
697	CORP.
697	PATRICK
679	WEST
674	DAVIS
671	AMERICAN
668	DIANE
667	HOLDINGS
666	JASON
661	SOLUTIONS,
660	LAURA
660	KELLY
659	MANAGEMENT,
657	LARRY
656	DOUGLAS
652	TAYLOR
651	DENNIS
649	ERIC
646	TRANSPORTATION
643	PARTNERS
642	NEW
637	SHARON
636	LLP
633	JANE
626	RYAN
622	CAPITAL
612	PC
612	ANDERSON
607	INTERNATIONAL
605	SARAH

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Here is the code that I used to print that frequently-used words list.
Sub Bus_Name_Word_Freq()
    Dim oDic As Object
    Dim vParsed As Variant
    Dim rs As Recordset
    Dim vItem As Variant
    Dim oKeys As Object
    Dim oValues As Object
    Dim lngLoop As Long
    Dim oList As Object
    
    Set oDic = CreateObject("system.collections.hashtable")
    
    Set rs = DBEngine(0)(0).OpenRecordset("Select UCASE(Trim(business_name_tx)) As trim_name from [T_MMA]")
    Do Until rs.EOF
        vParsed = Split(rs![trim_name], " ")
        For Each vItem In vParsed
            If oDic.containskey(vItem) Then
                oDic(vItem) = oDic(vItem) + 1
            Else
                oDic(vItem) = 1
            End If
        Next
        rs.MoveNext
    Loop
    
    Set oKeys = CreateObject("system.collections.arraylist")
    Set oValues = CreateObject("system.collections.arraylist")
    Set oList = CreateObject("system.collections.arraylist")
    oKeys.addrange oDic.Keys
    oValues.addrange oDic.Values
    For lngLoop = 0 To oValues.Count - 1
        oList.Add Format(oValues(lngLoop), "0000000") & "|" & oKeys(lngLoop)
    Next
    
    oList.Sort
    
    'print the top 200 most frequent words
    For lngLoop = (oValues.Count - 1) To (oValues.Count - 200) Step -1
        vParsed = Split(oList(lngLoop), "|")
        Debug.Print vParsed(0), vParsed(1)
        
    Next
    
End Sub

Open in new window


If the user-defined function had a designated list of frequently used words, it could create a second smaller regex object and use both of them to test the business name for a pattern match.  If it matched the smaller pattern, it could skip the larger pattern.

If the user-defined function knew which of the codes were actual business names, we could build a dictionary/hashtable/hashset of the names and do a hashed look up, which is extremely fast.
Coleen SullivanIndependent Contractor

Author

Commented:
The problem I see with the frequently used words idea is that there are too many instances where actual business names include proper names.  
That being said, I am using your suggestion and building out the BusinessCodes table, with fields identifying codes to be used in separate regex objects, thereby excluding them from the first.  Perhaps your suggestion above can be used with this set?  I will share it when I've completed.

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I wasn't suggesting you would use items from the frequent word list without consideration.  In a prior comment, I gave the example of "LLC" and "INC" as good candidates for a quick check list.
Coleen SullivanIndependent Contractor

Author

Commented:
Thanks - got it.
Another question - in  the commented code you provided above, where you renamed the parameter from (I believe) 'parmBusinessCode' to 'parmBusinessName' - is this calling the 'partial code match' routine?
Public Function PartialCodeMatch(ByVal parmBusinessCode As Variant) As Boolean
    Dim vParts As Variant
    Dim vItem As Variant
    If IsNull(parmBusinessCode) Then
        PartialCodeMatch = False
        Exit Function
    End If
    vParts = Split(parmBusinessCode, " ")
    For Each vItem In vParts
        If IsNull(DLookup("Code", "[T_BusinessCodes]", "code='" & Replace(vItem, "'", "''") & "'")) Then
        Else
            PartialCodeMatch = True
            Exit Function
        End If
    Next
    PartialCodeMatch = False
End Function

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I have not touched the PartialCodeMatch routine.  I think that is a relic from a prior version of the database.
Coleen SullivanIndependent Contractor

Author

Commented:
 oRE.Pattern = "([.()+])"   'characters that need escaping 
Should these 5 characters also be included?; & - / , ' 
Coleen SullivanIndependent Contractor

Author

Commented:
 oRE.Pattern = "([.()+])"   'characters that need escaping 
Actually 6 characters?:   -  / \ , '
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
If there are backslash characters, then yes.  A hyphen/dash character only has regex meaning inside of square brackets.  If there were any question mark characters or asterisk characters, then they would also need to be escaped.
Coleen SullivanIndependent Contractor

Author

Commented:
so ampersands, commas, apostrophes, colin and semi-colin are all OK?
I have no idea what characters are -or will be- in the data, so I want to be sure we are accommodating for all of the ones we know of that need to be addressed/escaped.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
so ampersands, commas, apostrophes, colin and semi-colin are all OK?
yes.

Remember that this only matters for the business codes table.  The business names table contents don't matter.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
As a follow-up on the frequent words, I've seen some instances where a business code is not delimited by a trailing space character.  Look at "Corp" (not "Corporation").  There are instances where the word in a business name is followed by a punctuation character.  In order to match this you will need to have multiple entries in your business code table ("Code", "Code.", "Code,", "Corp;").  A change to the generated regex pattern might greatly simplify your efforts.
Coleen SullivanIndependent Contractor

Author

Commented:
Yes, thanks.  I will review the standard business acronyms and ensure there are entries covering each punctuation possibility. 
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
It would be a better use of your time if the regex pattern handles this.
Social distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
Coleen SullivanIndependent Contractor

Author

Commented:
So I don't need to add those to my table?  I just did, but not a problem to remove.  
I'll test this version this afternoon - thank you.

Coleen SullivanIndependent Contractor

Author

Commented:
To be clear, this will still pick up the standard code-word from the table with
 - no trailing space or punctuation at the end of the string
and
 - no preceding space at the start of the string, followed by no punctuation, one space, and then anything?
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
It will match the item in the code table preceded by a space and followed by any of the following characters: space, period, comma, semicolon.

The codes you put into the code table should not have punctuation or any additional non-word characters unless that is a requirement for your match.  Do not add spaces before or after any of these codes.

Note: I trim all the codes of leading/trailing spaces as part of the transformation into a regular expression pattern
Coleen SullivanIndependent Contractor

Author

Commented:
Will it find the code word at the start of the string?
Law Office of anyname
I have removed all trailing punctuation, and I have no preceding spaces.
I have 2 entries that start with hyphen-space, three that start with ampersand-space, one that starts with a period ".com", one that starts with "/"and a handful that start with an open parentheses.


aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
If you put the punctuation in, it will match that string that includes that punctuation character.

Will it find the code word at the start of the string?
It should

Law Office of anyname
I'm not sure what you mean.  If the business code row is "Law Office", then all rows with that string should be matched.

You can test anything in the immediate window (Ctrl+G) in the VB project window.  Type a question mark followed by RegexMatch("business name or string you want to test") and press Enter.  It should return a True or False value.
Coleen SullivanIndependent Contractor

Author

Commented:
  strCodes = strCodes & "|" & oRE.Replace(rs![trim_code], "\$1")
I'm getting a 'Run-time error 13' on this line of code again.
I've decompiled, compacted closed the access and re-opened a few times, but it's still happening.  I can tell what is different with this line  versus previous versions.  Any ideas?
Coleen SullivanIndependent Contractor

Author

Commented:
*I can 'not' tell what is different with this code versus previous versions.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Have you compiled the module after the /decompile action?

In such circumstances, I've been able to 'fix' the problem with the following steps.
1. Export the module.  It will create a .bas file
2. Remove the module from the VB Project
3. If there are any remaining modules, do a Debug Compile in that module.
4. Close the database.
5. Open the database with /decompile
6. Import the module (1)
7. Debug Compile the code
8. Save the module
Coleen SullivanIndependent Contractor

Author

Commented:
Thank you.
Coleen SullivanIndependent Contractor

Author

Commented:
and yes- I have always been compiling after decompiling.
I ended up going to the last good backup I had (from 11/17) and manually edited the module with any new code. Decompiled, compiled, closed out, came back and I'm ok now.  It's good to know of all the options to use in the current db though - saves some effort.
Coleen SullivanIndependent Contractor

Author

Commented:
It happened again and I've followed all the steps you outline above, but no change.
Should I just create a new db and import everything over?

Coleen SullivanIndependent Contractor

Author

Commented:
It's concerning to me though as this is the same line of code that was causing this issue back in March also... is there something we can tweak in it so this doesn't keep happening?

Coleen SullivanIndependent Contractor

Author

Commented:
Looking at this more closely - is there a missing open-paren?

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
How large is the business code table? (row count and total string length)
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
If there were a syntax error, like a missing "(", then the code wouldn't compile.

The error is happening as it is building the regex pattern string, not during the matching process.
Coleen SullivanIndependent Contractor

Author

Commented:
4895  rows, 40716 total characters
Coleen SullivanIndependent Contractor

Author

Commented:
I think I fixed it...

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
Are there any Null rows in the business code table?
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014
I think I fixed it...
What did you do?
Coleen SullivanIndependent Contractor

Author

Commented:
The Business Codes table is created from a linked excel file - which is where all  revisions are made.  I went back to that file and deleted blank rows and columns, which seemed to do the trick.
Coleen SullivanIndependent Contractor

Author

Commented:
Thank you for all your help.  The new version is performing exactly as I hoped, with precise results.  It has also exposed some inconsistencies in my source table that I am revising.  In addition,  believe I can make my entire process more efficient by expanding the field hierarchy in my source table.  I will post this as a new question.
Thank you again for all of your help!