Link to home
Start Free TrialLog in
Avatar of LarryLoogla
LarryLoogla

asked on

Regular expression to identify a word in a sentence that does not previously contain another word

I need a regular expression that would match the word "foo" in the second sentence, but not in the first two.  The criteria would be that "foo" can only be matched if it is not followed by any of several words.  In this example they filter words might be "sentence" and/or "question".  The filter word might immediately precede the match word, but if it occurs in the previous sentence it should not be considered.

This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?

I tried the following, but it always matches "foo".  Given that a negative lookbehind may not be able to contain a regex, is this even possible.  It will run in the .Net Regex engine.
(?<!(?:sentence|question)[^\.?!:;]+)foo

Open in new window

Avatar of abel
abel
Flag of Netherlands image

What you need is a negative look forward (they are zero width, the are). The following works for me:

string threeSentenceNoMatch = "first sentence with foo\nsecond foo sentence\na sentence with foo";
string threeSentenceMatch = "first lecture with foo\nsecond foo leture\na lecture with foo";
string threeSentenceMatch2 = "first sentence with foo\nsecond foo sentence\na foo sentence";
if (Regex.IsMatch(threeSentenceNoMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("wrong! should not match");
if (Regex.IsMatch(threeSentenceMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("coorect! should match");
if (Regex.IsMatch(threeSentenceMatch2, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("coorect! should match");

Open in new window

So, after two sentences, we arrive at the start of the third sentence. Then we ask "is there a word "sentence" in this part?". If the answer is yes, we ask again "is it followed by "foo"? If it is, the whole thing is "negative look forward", which means the match will fail. If it is not (i.e., there is no "sentence" or there is no "foo" following the word "sentence") then the match will succeed.

On my test, the output was too lines with "coorect! should match".
Important note: the above does NOT work with the "s" modifiers as they change the meaning of the dot to include the \n.
Avatar of LarryLoogla
LarryLoogla

ASKER

Hi abel,

I see now how the positive lookahead (?=.*foo)) wrapped in the negative lookahead could work.  Unfortunately, one of my constraints is that the text will be in paragraph form, not individual sentences.  I can't change these internals.  The sentence should match any occurrences of "foo" in a sentence where it is not preceded by "sentence" or "question".

The goal would be to match "foo" in the second sentence only.  I tried to apply your method, but it matches all three "foo"s.  What am I doing wrong?
string threeSentence = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";
            Regex regex = new Regex(@"(?!(?:sentence|question)(?=[^\.?!:;]+foo))foo");
            foreach (Match match in regex.Matches(threeSentence))
            {
                Debug.WriteLine("Word foo at index " + match.Index.ToString() + " has text " + match.Value);
            }

Open in new window

You mean, a sentence is a real sentence, not one separated by newlines. Can you show me three examples, like I did, so that I can be sure of amending the regex correctly for you? Make sure they represent your case(s) well.
There's something else I don't understand. You now say second sentence, and you said this: "I need a regular expression that would match the word "foo" in the second sentence, but not in the first two. " It cannot both not be in the first two and match in the second... which is why I came up with three sentences.
Abel,

Sorry for the error and ambiguity. You are absolutely right.  I caught my mistake after submitting.  It should have read  I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.


 The goal would be to match "foo" in the sentences that don't contain the filter words "sentence" or "question".  Sentences are separated by normal punctuation characters, i.e. ".?!:; and combined with other sentences in the text to be matched.  The filter words are arbitrary and will be a long list of words in practice.


Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below.  Each string contains three sentences.



string matchesSecond = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";
 
string matchesFirstOnly = "Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?"
 
string matchesSecondAndThird = "This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too."
 
Regex testRegex = new Regex(@"(?!(?:[Ss]entence|[Qq]uestion)(?=[^\.?!:;]+foo))[Ff]oo");

Open in new window

To clarify my statement "Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below".  What I mean is "...regex that errantly matches all instances of "foo".  It should only match "foo" as indicated by the string variable names below.  For example, in the first string, matchesSecond, only the second occurrence of foo should be matched. "

I should also add that if "foo" occurs more than once in a sentence that does not contain a filter word, all instances should match. so "This foo has foo twice" should match "foo" twice.
Avatar of Nasir Razzaq
What about manual string matching in a loop?
Split the text on . may be and then use contains method.

if str.contains("foo") then

end if
(sorry for the late follow-up, I was caught up in something and forgot about it)

> I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.

and:

> It should only match "foo" as indicated by the string variable names below.  

not sure if it is clear now. Your second string says matchesFirstOnly, but that does not comply with the first statement above, as it is in the second sentence. Let met try if I understand it:

> This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?
                                                              ^
(because the first and last don't match because they are preceded by "sentence" or "question")
 
> Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?";
  ^
(because the name of the variable says "matchesFirstOnly", or because sentence two and three are out, because preceded with "sentence" and "question"?
 
> This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too.
                                                            ^                       ^
(the first does not match, it has "sentence" preceding foo, the second does not, so it must match, the third does not, so it must match too)

Open in new window

All the above matches would be found when I would create a regular expression that would match "foo" if not preceded by the word "sentence" or "question". If that is true, we can ignore your idea about "must match in second sentence, not in first or third".

For the follow-ups, I will replace the word "sentence" with "searchtermA" and "question" with "searchtermB", because I am under the impression that the choice of search terms meddles with understanding the problem (the question is how to find foo in a sentence where sentence or question is not preceding foo in the sentence, savvy? ;-)

in other words, this is a tad bit more readable: "the question is how to find foo in a sentence where searchTermA and searchTermB are not preceding foo". I think that explains it correctly, at least for the three terms above.

I'll be back with a new try on that last requirement.
Ok, here you go. The trick I pulled, to make it manageable, is to find all sentences containing searchTerm followed by foo. This is straightforward, and we need the opposite. So we replace all we find with nothing and the result is searched again for the presence of foo. Not too hard, eventually, it takes a little while to find the twist though. Here it is (code could be prettier, but I leave that to you, please run it first as-is and only then try to change it to your real requirements).

It prints this to the debug out window:

line contains:   This bar, also has foo.
line contains: Foo statement doesn't have filter word.
line contains:   This one should match foo.
line contains:    In this example, foo should match too.

string matchesSecond = "This searchTermA has a the word foo in it.  This bar, also has foo.  Does this searchTermB have \"foo\" in it?";
string matchesFirstOnly = "Foo statement doesn't have filter word.  SearchtermA two is foo!  Is searchTermB a foo filter?";
string matchesSecondAndThird = "This searchTermA shouldn't match foo:  This one should match foo.   In this example, foo should match too.";
 
Regex testRegex = new Regex(@"([^.?!:;]*(?:searchTermA|searchTermB)[^.?!:;]*foo[^.?!:;]*[.?!:;])", RegexOptions.IgnoreCase);
Regex fooFinder = new Regex("[^.?!:;]*foo[^.?!:;]*[.?!:;]", RegexOptions.IgnoreCase);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecond, "")))
    Debug.WriteLine("line contains: " + match.Value);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesFirstOnly, "")))
    Debug.WriteLine("line contains: " + match.Value);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecondAndThird, "")))
    Debug.WriteLine("line contains: " + match.Value);

Open in new window

note, that you used [\.] in your expressions, this should be [.] as the dot does not have special meaning inside character classes.
BTW: the above approach is probably easier if you follow CodeCruiser's idea and split the string on [.?!:;] and then check whether it matches "(?:searchTermA|searchTermB).*foo". If it does not match that, you should check whether it matches "foo" and then you have a match. You can do that in one go as follows, using a negative look-forward or a look-behind. Here's the latter, which only works on .NET and not in any other regex flavor, using some LINQ for ease of coding. It will print:

Result:   This one should match foo
Result:    In this example, foo should match too

var results = from s in matchesSecondAndThird.Split(".;:?!".ToCharArray())
                where Regex.IsMatch(s, "(?<!(?:searchTermA|searchTermB).*)foo")
                select s;
 
foreach (string s in results)
    Debug.WriteLine("Result: " + s);

Open in new window

BTW2: the reason your earlier attempt on using "lookaround" did not work, is because it is so hard to create an "anchor". If you use look-forward or look-behind, it is very important to know how a regular expression engine works and how backtracking is implemented, to also understand how look around works. This is a trivial case when you split the string, but the anchoring problem (where should the regex start the subregex?) is very hard to implement correctly if you don't split up. Without anchoring and when you reverse the match (search for everything that's wrong) you make it doable (in this solution).
No worries on the late reply.  In example string two, the one named matchesFirstOnly, "foo"  should be matched in the first sentence only because it is an occurrence of the word "foo"  that is not preceded by a filter word.  The other sentences in the string contain the word "foo"but preceded by the filter word in the same sentence.

I thought about taking your approach to eliminate sentences containing filter words or using codecruiser's tactic, but as I mentioned I can't change the internals without weeks of rework.  The exact goal is to detect the index of the occurrence of "foo" not preceded by filter words in the same sentence, then push it on for further processing.  This regex will be one among thousands stored in a DB that will be doing other kinds of word identification.  Unfortunately, I can't change the internals to use another method without reworking how all of these other regexes are used.  The process would be prohibitive.  

Without overloading you with unrelated details, it must return a match on the word foo so I can use it's index to look it up in another source.

It's beginning to sound like establishing an anchor is impossible?  Is there no statement that can look left to right and anchor when it matches "foo"  then backtrack until it finds a punctuation or string start to make sure there is no filter word?  

I could potentially set the RightToLeft RegexOption if it would somehow help?
ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You are kidding me.  Sorry if the forehead slapping sound startled you.

The issue was that for testing, it's much faster to use my regex notepad than to run everything through the .Net sandbox.  Being a regex novice, I didn't realize the extent to which the engines could differ.  Score one for Microsoft!

Thanks for the assist.  never would of known the answer was hiding in plain sight without your insights.
See my last comment on the page.  Also thanks for the informative explanations.
Thanks for the compliment. Rereading the thread shows me where I went wrong (the third sentence prb), now that I understood that requirements fully, it became apparent that you were well on the right way yourself all along.

Indeed, the regular expression engines are largely different, and luckily there's also consensus, the PCRE (perl style) regular expressions. It is at least much better then, say, 10 years ago, when the differences were unsurmountable and many languages didn't even have regex engines built in.

More info on regular expressions is on http://www.regular-expressions.info (contains engine comparison tables as well) and a must read if you do this regularly: Mastering Regular Expressions by Jeffrey Friedl.

Glad to be of help and tx for the points ;-)
I'm getting more familiar with the site all the time.  It's the best resource on the web;)

Here's the final solution for future readers
(?<=(?:(?:[Ff]ilterOne|[Ff]ilterTwo)(?:[^.?!:;](?!foo))+)\s)foo

Open in new window

Looks good on the first view. If you are using .NET regexes, consider using the "n" option, then every normal parenthesis is not a capturing one anymore, it makes it more readable in cases like yours: http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx