Link to home
Start Free TrialLog in
Avatar of futr_vision
futr_vision

asked on

How do I use REreplace to filter out abbreviations?

I am using Verity and needed to expand the "stop words" list to filter out common terms used in companies names such as: company, incorporated, corporation, etc..

Not only did I need to filter out those terms but their abbreviations as well such as: com, inc, inc., corp, etc...

I initially used replacelist which works great however it hiccups when filtering abbreviations. For example, while using replacelist to filter "inc" works it also changes "Lincoln" to "L oln" and "Communications" to "munications"

The solution, I believe, lies in passing the result from replacelist to a REreplace filter but I am not positive how to write the regular expression. I would want to filter any abbreviations with and without a period. Below is the code I have so far. I've shortened the list of terms I am replacing since it is quite long.
<cfset search_term = lcase(url.searchTerm)>
<cfset search_term_cleaned = replaceList(search_term, "associates,assoc,bank,companies,company,com,corp,holdings,incorporated,industries,trust,corporation"," , , , , , , , , , , , , ,")>
<cfset search_term_final = REreplace(search_term_cleaned, "REGEX here","")>

Open in new window

Avatar of ddrudik
ddrudik
Flag of United States of America image

\binc\b would match "inc" when it is bordered by a \W character [^A-Za-z0-9_] or start/end of a string, maybe that will help you.
Avatar of futr_vision
futr_vision

ASKER

I have series of these abbreviations I need to filter. Is there a way to include them all in one REreplace statement?
ASKER CERTIFIED SOLUTION
Avatar of ddrudik
ddrudik
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Great! And if they use a period after the abbreviation such as in 'inc." I need to escape the period using "\" correct?
Yes, but note that . is in \W and \b would allow\w following to match.

Given
\binc\.\b

would match:
test inc.
test inc.a

but not:
test inc.,
Also, note that \binc\.\b would not match "test inc. something" given that "." and " " are both in \W.
Looking at this it is probably not necessary to account for the "." since "." will not return any results in  a search. I'll go with your solution as-is. Thanks
Thanks for the question and the points.