Solved

Regular expression to identify a word in a sentence that does not previously contain another word

Posted on 2009-05-18
24
1,178 Views
Last Modified: 2012-05-07
I need a regular expression that would match the word "foo" in the second sentence, but not in the first two.  The criteria would be that "foo" can only be matched if it is not followed by any of several words.  In this example they filter words might be "sentence" and/or "question".  The filter word might immediately precede the match word, but if it occurs in the previous sentence it should not be considered.

This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?

I tried the following, but it always matches "foo".  Given that a negative lookbehind may not be able to contain a regex, is this even possible.  It will run in the .Net Regex engine.
(?<!(?:sentence|question)[^\.?!:;]+)foo

Open in new window

0
Comment
Question by:LarryLoogla
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 14
  • 7
24 Comments
 
LVL 39

Expert Comment

by:abel
ID: 24419361
What you need is a negative look forward (they are zero width, the are). The following works for me:

string threeSentenceNoMatch = "first sentence with foo\nsecond foo sentence\na sentence with foo";
string threeSentenceMatch = "first lecture with foo\nsecond foo leture\na lecture with foo";
string threeSentenceMatch2 = "first sentence with foo\nsecond foo sentence\na foo sentence";
if (Regex.IsMatch(threeSentenceNoMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("wrong! should not match");
if (Regex.IsMatch(threeSentenceMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("coorect! should match");
if (Regex.IsMatch(threeSentenceMatch2, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))
    Debug.WriteLine("coorect! should match");

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24419378
So, after two sentences, we arrive at the start of the third sentence. Then we ask "is there a word "sentence" in this part?". If the answer is yes, we ask again "is it followed by "foo"? If it is, the whole thing is "negative look forward", which means the match will fail. If it is not (i.e., there is no "sentence" or there is no "foo" following the word "sentence") then the match will succeed.

On my test, the output was too lines with "coorect! should match".
0
 
LVL 39

Expert Comment

by:abel
ID: 24419381
Important note: the above does NOT work with the "s" modifiers as they change the meaning of the dot to include the \n.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:LarryLoogla
ID: 24419542
Hi abel,

I see now how the positive lookahead (?=.*foo)) wrapped in the negative lookahead could work.  Unfortunately, one of my constraints is that the text will be in paragraph form, not individual sentences.  I can't change these internals.  The sentence should match any occurrences of "foo" in a sentence where it is not preceded by "sentence" or "question".

The goal would be to match "foo" in the second sentence only.  I tried to apply your method, but it matches all three "foo"s.  What am I doing wrong?
string threeSentence = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";
            Regex regex = new Regex(@"(?!(?:sentence|question)(?=[^\.?!:;]+foo))foo");
            foreach (Match match in regex.Matches(threeSentence))
            {
                Debug.WriteLine("Word foo at index " + match.Index.ToString() + " has text " + match.Value);
            }

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24419654
You mean, a sentence is a real sentence, not one separated by newlines. Can you show me three examples, like I did, so that I can be sure of amending the regex correctly for you? Make sure they represent your case(s) well.
0
 
LVL 39

Expert Comment

by:abel
ID: 24420239
There's something else I don't understand. You now say second sentence, and you said this: "I need a regular expression that would match the word "foo" in the second sentence, but not in the first two. " It cannot both not be in the first two and match in the second... which is why I came up with three sentences.
0
 

Author Comment

by:LarryLoogla
ID: 24424822
Abel,

Sorry for the error and ambiguity. You are absolutely right.  I caught my mistake after submitting.  It should have read  I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.


 The goal would be to match "foo" in the sentences that don't contain the filter words "sentence" or "question".  Sentences are separated by normal punctuation characters, i.e. ".?!:; and combined with other sentences in the text to be matched.  The filter words are arbitrary and will be a long list of words in practice.


Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below.  Each string contains three sentences.



string matchesSecond = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";
 
string matchesFirstOnly = "Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?"
 
string matchesSecondAndThird = "This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too."
 
Regex testRegex = new Regex(@"(?!(?:[Ss]entence|[Qq]uestion)(?=[^\.?!:;]+foo))[Ff]oo");

Open in new window

0
 

Author Comment

by:LarryLoogla
ID: 24424926
To clarify my statement "Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below".  What I mean is "...regex that errantly matches all instances of "foo".  It should only match "foo" as indicated by the string variable names below.  For example, in the first string, matchesSecond, only the second occurrence of foo should be matched. "

I should also add that if "foo" occurs more than once in a sentence that does not contain a filter word, all instances should match. so "This foo has foo twice" should match "foo" twice.
0
 
LVL 83

Expert Comment

by:CodeCruiser
ID: 24438829
What about manual string matching in a loop?
Split the text on . may be and then use contains method.

if str.contains("foo") then

end if
0
 
LVL 39

Expert Comment

by:abel
ID: 24439322
(sorry for the late follow-up, I was caught up in something and forgot about it)

> I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.

and:

> It should only match "foo" as indicated by the string variable names below.  

not sure if it is clear now. Your second string says matchesFirstOnly, but that does not comply with the first statement above, as it is in the second sentence. Let met try if I understand it:

> This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?
                                                              ^
(because the first and last don't match because they are preceded by "sentence" or "question")
 
> Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?";
  ^
(because the name of the variable says "matchesFirstOnly", or because sentence two and three are out, because preceded with "sentence" and "question"?
 
> This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too.
                                                            ^                       ^
(the first does not match, it has "sentence" preceding foo, the second does not, so it must match, the third does not, so it must match too)

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24439357
All the above matches would be found when I would create a regular expression that would match "foo" if not preceded by the word "sentence" or "question". If that is true, we can ignore your idea about "must match in second sentence, not in first or third".

For the follow-ups, I will replace the word "sentence" with "searchtermA" and "question" with "searchtermB", because I am under the impression that the choice of search terms meddles with understanding the problem (the question is how to find foo in a sentence where sentence or question is not preceding foo in the sentence, savvy? ;-)

in other words, this is a tad bit more readable: "the question is how to find foo in a sentence where searchTermA and searchTermB are not preceding foo". I think that explains it correctly, at least for the three terms above.

I'll be back with a new try on that last requirement.
0
 
LVL 39

Expert Comment

by:abel
ID: 24439570
Ok, here you go. The trick I pulled, to make it manageable, is to find all sentences containing searchTerm followed by foo. This is straightforward, and we need the opposite. So we replace all we find with nothing and the result is searched again for the presence of foo. Not too hard, eventually, it takes a little while to find the twist though. Here it is (code could be prettier, but I leave that to you, please run it first as-is and only then try to change it to your real requirements).

It prints this to the debug out window:

line contains:   This bar, also has foo.
line contains: Foo statement doesn't have filter word.
line contains:   This one should match foo.
line contains:    In this example, foo should match too.

string matchesSecond = "This searchTermA has a the word foo in it.  This bar, also has foo.  Does this searchTermB have \"foo\" in it?";
string matchesFirstOnly = "Foo statement doesn't have filter word.  SearchtermA two is foo!  Is searchTermB a foo filter?";
string matchesSecondAndThird = "This searchTermA shouldn't match foo:  This one should match foo.   In this example, foo should match too.";
 
Regex testRegex = new Regex(@"([^.?!:;]*(?:searchTermA|searchTermB)[^.?!:;]*foo[^.?!:;]*[.?!:;])", RegexOptions.IgnoreCase);
Regex fooFinder = new Regex("[^.?!:;]*foo[^.?!:;]*[.?!:;]", RegexOptions.IgnoreCase);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecond, "")))
    Debug.WriteLine("line contains: " + match.Value);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesFirstOnly, "")))
    Debug.WriteLine("line contains: " + match.Value);
 
foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecondAndThird, "")))
    Debug.WriteLine("line contains: " + match.Value);

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24439578
note, that you used [\.] in your expressions, this should be [.] as the dot does not have special meaning inside character classes.
0
 
LVL 39

Expert Comment

by:abel
ID: 24440201
BTW: the above approach is probably easier if you follow CodeCruiser's idea and split the string on [.?!:;] and then check whether it matches "(?:searchTermA|searchTermB).*foo". If it does not match that, you should check whether it matches "foo" and then you have a match. You can do that in one go as follows, using a negative look-forward or a look-behind. Here's the latter, which only works on .NET and not in any other regex flavor, using some LINQ for ease of coding. It will print:

Result:   This one should match foo
Result:    In this example, foo should match too

var results = from s in matchesSecondAndThird.Split(".;:?!".ToCharArray())
                where Regex.IsMatch(s, "(?<!(?:searchTermA|searchTermB).*)foo")
                select s;
 
foreach (string s in results)
    Debug.WriteLine("Result: " + s);

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24440229
BTW2: the reason your earlier attempt on using "lookaround" did not work, is because it is so hard to create an "anchor". If you use look-forward or look-behind, it is very important to know how a regular expression engine works and how backtracking is implemented, to also understand how look around works. This is a trivial case when you split the string, but the anchoring problem (where should the regex start the subregex?) is very hard to implement correctly if you don't split up. Without anchoring and when you reverse the match (search for everything that's wrong) you make it doable (in this solution).
0
 

Author Comment

by:LarryLoogla
ID: 24443761
No worries on the late reply.  In example string two, the one named matchesFirstOnly, "foo"  should be matched in the first sentence only because it is an occurrence of the word "foo"  that is not preceded by a filter word.  The other sentences in the string contain the word "foo"but preceded by the filter word in the same sentence.

I thought about taking your approach to eliminate sentences containing filter words or using codecruiser's tactic, but as I mentioned I can't change the internals without weeks of rework.  The exact goal is to detect the index of the occurrence of "foo" not preceded by filter words in the same sentence, then push it on for further processing.  This regex will be one among thousands stored in a DB that will be doing other kinds of word identification.  Unfortunately, I can't change the internals to use another method without reworking how all of these other regexes are used.  The process would be prohibitive.  

Without overloading you with unrelated details, it must return a match on the word foo so I can use it's index to look it up in another source.

It's beginning to sound like establishing an anchor is impossible?  Is there no statement that can look left to right and anchor when it matches "foo"  then backtrack until it finds a punctuation or string start to make sure there is no filter word?  

I could potentially set the RightToLeft RegexOption if it would somehow help?
0
 
LVL 39

Accepted Solution

by:
abel earned 500 total points
ID: 24446591
Maybe I can make you happy after all. I found out (didn't know before) that the regex engine of .NET supports regular expressions inside look-behind expressions, and it's the only one so far that I know of.

The code below is scaringly equal to the code that you tried yourself. In fact, it IS equal, and I'm breaking my head about it why I didn't try your own regex first. In fact, it just works (even with the wrong \. in it). So I'm wondering how your system perhaps wraps up the regex or how it is added to an even large regex that it doesn't work for you. Here's what I tried and what printed the following to my debug window, which is correct I believe:
found: 'foo' at position 63
found: 'Foo' at position 0
found: 'foo' at position 61
found: 'foo' at position 85


- Abel --

string matchesSecond = "This searchTermA has a the word foo in it.  This bar, also has foo.  Does this searchTermB have \"foo\" in it?";
string matchesFirstOnly = "Foo statement doesn't have filter word.  SearchtermA two is foo!  Is searchTermB a foo filter?";
string matchesSecondAndThird = "This searchTermA shouldn't match foo:  This one should match foo.   In this example, foo should match too.";
 
 
Regex newRegex = new Regex("(?<!(?:searchTermA|searchTermB)[^.?!:;]*)foo", RegexOptions.IgnoreCase);
foreach(string s in new string[] {matchesSecond, matchesFirstOnly, matchesSecondAndThird})
{
    foreach (Match match in newRegex.Matches(s))
    {
        Debug.WriteLine("found: '" + match.Value + "' at position " + match.Index); 
    }
}

Open in new window

0
 

Author Comment

by:LarryLoogla
ID: 24447796
You are kidding me.  Sorry if the forehead slapping sound startled you.

The issue was that for testing, it's much faster to use my regex notepad than to run everything through the .Net sandbox.  Being a regex novice, I didn't realize the extent to which the engines could differ.  Score one for Microsoft!

Thanks for the assist.  never would of known the answer was hiding in plain sight without your insights.
0
 

Author Closing Comment

by:LarryLoogla
ID: 31582880
See my last comment on the page.  Also thanks for the informative explanations.
0
 
LVL 39

Expert Comment

by:abel
ID: 24448922
Thanks for the compliment. Rereading the thread shows me where I went wrong (the third sentence prb), now that I understood that requirements fully, it became apparent that you were well on the right way yourself all along.

Indeed, the regular expression engines are largely different, and luckily there's also consensus, the PCRE (perl style) regular expressions. It is at least much better then, say, 10 years ago, when the differences were unsurmountable and many languages didn't even have regex engines built in.

More info on regular expressions is on http://www.regular-expressions.info (contains engine comparison tables as well) and a must read if you do this regularly: Mastering Regular Expressions by Jeffrey Friedl.

Glad to be of help and tx for the points ;-)
0
 

Author Comment

by:LarryLoogla
ID: 24477001
I'm getting more familiar with the site all the time.  It's the best resource on the web;)

Here's the final solution for future readers
(?<=(?:(?:[Ff]ilterOne|[Ff]ilterTwo)(?:[^.?!:;](?!foo))+)\s)foo

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 24478860
Looks good on the first view. If you are using .NET regexes, consider using the "n" option, then every normal parenthesis is not a capturing one anymore, it makes it more readable in cases like yours: http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A short article about problems I had with the new location API and permissions in Marshmallow
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
Progress

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question