futr_vision
asked on
How do I use REreplace to filter out abbreviations?
I am using Verity and needed to expand the "stop words" list to filter out common terms used in companies names such as: company, incorporated, corporation, etc..
Not only did I need to filter out those terms but their abbreviations as well such as: com, inc, inc., corp, etc...
I initially used replacelist which works great however it hiccups when filtering abbreviations. For example, while using replacelist to filter "inc" works it also changes "Lincoln" to "L oln" and "Communications" to "munications"
The solution, I believe, lies in passing the result from replacelist to a REreplace filter but I am not positive how to write the regular expression. I would want to filter any abbreviations with and without a period. Below is the code I have so far. I've shortened the list of terms I am replacing since it is quite long.
Not only did I need to filter out those terms but their abbreviations as well such as: com, inc, inc., corp, etc...
I initially used replacelist which works great however it hiccups when filtering abbreviations. For example, while using replacelist to filter "inc" works it also changes "Lincoln" to "L oln" and "Communications" to "munications"
The solution, I believe, lies in passing the result from replacelist to a REreplace filter but I am not positive how to write the regular expression. I would want to filter any abbreviations with and without a period. Below is the code I have so far. I've shortened the list of terms I am replacing since it is quite long.
<cfset search_term = lcase(url.searchTerm)>
<cfset search_term_cleaned = replaceList(search_term, "associates,assoc,bank,companies,company,com,corp,holdings,incorporated,industries,trust,corporation"," , , , , , , , , , , , , ,")>
<cfset search_term_final = REreplace(search_term_cleaned, "REGEX here","")>
\binc\b would match "inc" when it is bordered by a \W character [^A-Za-z0-9_] or start/end of a string, maybe that will help you.
ASKER
I have series of these abbreviations I need to filter. Is there a way to include them all in one REreplace statement?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Great! And if they use a period after the abbreviation such as in 'inc." I need to escape the period using "\" correct?
Yes, but note that . is in \W and \b would allow\w following to match.
Given
\binc\.\b
would match:
test inc.
test inc.a
but not:
test inc.,
Given
\binc\.\b
would match:
test inc.
test inc.a
but not:
test inc.,
Also, note that \binc\.\b would not match "test inc. something" given that "." and " " are both in \W.
ASKER
Looking at this it is probably not necessary to account for the "." since "." will not return any results in a search. I'll go with your solution as-is. Thanks
Thanks for the question and the points.