Solved

Regular expression to identify a word in a sentence that does not previously contain another word

Posted on 2009-05-18
24
1,144 Views
Last Modified: 2012-05-07
I need a regular expression that would match the word "foo" in the second sentence, but not in the first two.  The criteria would be that "foo" can only be matched if it is not followed by any of several words.  In this example they filter words might be "sentence" and/or "question".  The filter word might immediately precede the match word, but if it occurs in the previous sentence it should not be considered.

This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?

I tried the following, but it always matches "foo".  Given that a negative lookbehind may not be able to contain a regex, is this even possible.  It will run in the .Net Regex engine.
(?<!(?:sentence|question)[^\.?!:;]+)foo

Open in new window

0
Comment
Question by:LarryLoogla
  • 14
  • 7
24 Comments
 
LVL 39

Expert Comment

by:abel
Comment Utility
What you need is a negative look forward (they are zero width, the are). The following works for me:

string threeSentenceNoMatch = "first sentence with foo\nsecond foo sentence\na sentence with foo";

string threeSentenceMatch = "first lecture with foo\nsecond foo leture\na lecture with foo";

string threeSentenceMatch2 = "first sentence with foo\nsecond foo sentence\na foo sentence";

if (Regex.IsMatch(threeSentenceNoMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))

    Debug.WriteLine("wrong! should not match");

if (Regex.IsMatch(threeSentenceMatch, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))

    Debug.WriteLine("coorect! should match");

if (Regex.IsMatch(threeSentenceMatch2, "(.*\n){2}(?!.*sentence(?=.*foo)).*foo"))

    Debug.WriteLine("coorect! should match");

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
So, after two sentences, we arrive at the start of the third sentence. Then we ask "is there a word "sentence" in this part?". If the answer is yes, we ask again "is it followed by "foo"? If it is, the whole thing is "negative look forward", which means the match will fail. If it is not (i.e., there is no "sentence" or there is no "foo" following the word "sentence") then the match will succeed.

On my test, the output was too lines with "coorect! should match".
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Important note: the above does NOT work with the "s" modifiers as they change the meaning of the dot to include the \n.
0
 

Author Comment

by:LarryLoogla
Comment Utility
Hi abel,

I see now how the positive lookahead (?=.*foo)) wrapped in the negative lookahead could work.  Unfortunately, one of my constraints is that the text will be in paragraph form, not individual sentences.  I can't change these internals.  The sentence should match any occurrences of "foo" in a sentence where it is not preceded by "sentence" or "question".

The goal would be to match "foo" in the second sentence only.  I tried to apply your method, but it matches all three "foo"s.  What am I doing wrong?
string threeSentence = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";

            Regex regex = new Regex(@"(?!(?:sentence|question)(?=[^\.?!:;]+foo))foo");

            foreach (Match match in regex.Matches(threeSentence))

            {

                Debug.WriteLine("Word foo at index " + match.Index.ToString() + " has text " + match.Value);

            }

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
You mean, a sentence is a real sentence, not one separated by newlines. Can you show me three examples, like I did, so that I can be sure of amending the regex correctly for you? Make sure they represent your case(s) well.
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
There's something else I don't understand. You now say second sentence, and you said this: "I need a regular expression that would match the word "foo" in the second sentence, but not in the first two. " It cannot both not be in the first two and match in the second... which is why I came up with three sentences.
0
 

Author Comment

by:LarryLoogla
Comment Utility
Abel,

Sorry for the error and ambiguity. You are absolutely right.  I caught my mistake after submitting.  It should have read  I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.


 The goal would be to match "foo" in the sentences that don't contain the filter words "sentence" or "question".  Sentences are separated by normal punctuation characters, i.e. ".?!:; and combined with other sentences in the text to be matched.  The filter words are arbitrary and will be a long list of words in practice.


Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below.  Each string contains three sentences.



string matchesSecond = "This sentence has a the word foo in it.  This bar, also has foo.  Does this question have \"foo\" in it?";
 

string matchesFirstOnly = "Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?"
 

string matchesSecondAndThird = "This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too."
 

Regex testRegex = new Regex(@"(?!(?:[Ss]entence|[Qq]uestion)(?=[^\.?!:;]+foo))[Ff]oo");

Open in new window

0
 

Author Comment

by:LarryLoogla
Comment Utility
To clarify my statement "Here are some samples and my stab at a regex that matches all instances of "foo".  It should match "foo" as indicated by the string variable names below".  What I mean is "...regex that errantly matches all instances of "foo".  It should only match "foo" as indicated by the string variable names below.  For example, in the first string, matchesSecond, only the second occurrence of foo should be matched. "

I should also add that if "foo" occurs more than once in a sentence that does not contain a filter word, all instances should match. so "This foo has foo twice" should match "foo" twice.
0
 
LVL 83

Expert Comment

by:CodeCruiser
Comment Utility
What about manual string matching in a loop?
Split the text on . may be and then use contains method.

if str.contains("foo") then

end if
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
(sorry for the late follow-up, I was caught up in something and forgot about it)

> I need a regular expression that would match the word "foo" in the second sentence, but not in the first or third.

and:

> It should only match "foo" as indicated by the string variable names below.  

not sure if it is clear now. Your second string says matchesFirstOnly, but that does not comply with the first statement above, as it is in the second sentence. Let met try if I understand it:

> This sentence has a the word foo in it.  This bar, also has foo.  Does this question have "foo" in it?

                                                              ^

(because the first and last don't match because they are preceded by "sentence" or "question")
 

> Foo statement doesn't have filter word.  Sentence two is foo!  Is question a foo filter?";

  ^

(because the name of the variable says "matchesFirstOnly", or because sentence two and three are out, because preceded with "sentence" and "question"?
 

> This sentence shouldn't match foo:  This one should match foo.   In this example, foo should match too.

                                                            ^                       ^

(the first does not match, it has "sentence" preceding foo, the second does not, so it must match, the third does not, so it must match too)

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
All the above matches would be found when I would create a regular expression that would match "foo" if not preceded by the word "sentence" or "question". If that is true, we can ignore your idea about "must match in second sentence, not in first or third".

For the follow-ups, I will replace the word "sentence" with "searchtermA" and "question" with "searchtermB", because I am under the impression that the choice of search terms meddles with understanding the problem (the question is how to find foo in a sentence where sentence or question is not preceding foo in the sentence, savvy? ;-)

in other words, this is a tad bit more readable: "the question is how to find foo in a sentence where searchTermA and searchTermB are not preceding foo". I think that explains it correctly, at least for the three terms above.

I'll be back with a new try on that last requirement.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 39

Expert Comment

by:abel
Comment Utility
Ok, here you go. The trick I pulled, to make it manageable, is to find all sentences containing searchTerm followed by foo. This is straightforward, and we need the opposite. So we replace all we find with nothing and the result is searched again for the presence of foo. Not too hard, eventually, it takes a little while to find the twist though. Here it is (code could be prettier, but I leave that to you, please run it first as-is and only then try to change it to your real requirements).

It prints this to the debug out window:

line contains:   This bar, also has foo.
line contains: Foo statement doesn't have filter word.
line contains:   This one should match foo.
line contains:    In this example, foo should match too.

string matchesSecond = "This searchTermA has a the word foo in it.  This bar, also has foo.  Does this searchTermB have \"foo\" in it?";

string matchesFirstOnly = "Foo statement doesn't have filter word.  SearchtermA two is foo!  Is searchTermB a foo filter?";

string matchesSecondAndThird = "This searchTermA shouldn't match foo:  This one should match foo.   In this example, foo should match too.";
 

Regex testRegex = new Regex(@"([^.?!:;]*(?:searchTermA|searchTermB)[^.?!:;]*foo[^.?!:;]*[.?!:;])", RegexOptions.IgnoreCase);

Regex fooFinder = new Regex("[^.?!:;]*foo[^.?!:;]*[.?!:;]", RegexOptions.IgnoreCase);
 

foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecond, "")))

    Debug.WriteLine("line contains: " + match.Value);
 

foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesFirstOnly, "")))

    Debug.WriteLine("line contains: " + match.Value);
 

foreach (Match match in fooFinder.Matches(testRegex.Replace(matchesSecondAndThird, "")))

    Debug.WriteLine("line contains: " + match.Value);

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
note, that you used [\.] in your expressions, this should be [.] as the dot does not have special meaning inside character classes.
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
BTW: the above approach is probably easier if you follow CodeCruiser's idea and split the string on [.?!:;] and then check whether it matches "(?:searchTermA|searchTermB).*foo". If it does not match that, you should check whether it matches "foo" and then you have a match. You can do that in one go as follows, using a negative look-forward or a look-behind. Here's the latter, which only works on .NET and not in any other regex flavor, using some LINQ for ease of coding. It will print:

Result:   This one should match foo
Result:    In this example, foo should match too

var results = from s in matchesSecondAndThird.Split(".;:?!".ToCharArray())

                where Regex.IsMatch(s, "(?<!(?:searchTermA|searchTermB).*)foo")

                select s;
 

foreach (string s in results)

    Debug.WriteLine("Result: " + s);

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
BTW2: the reason your earlier attempt on using "lookaround" did not work, is because it is so hard to create an "anchor". If you use look-forward or look-behind, it is very important to know how a regular expression engine works and how backtracking is implemented, to also understand how look around works. This is a trivial case when you split the string, but the anchoring problem (where should the regex start the subregex?) is very hard to implement correctly if you don't split up. Without anchoring and when you reverse the match (search for everything that's wrong) you make it doable (in this solution).
0
 

Author Comment

by:LarryLoogla
Comment Utility
No worries on the late reply.  In example string two, the one named matchesFirstOnly, "foo"  should be matched in the first sentence only because it is an occurrence of the word "foo"  that is not preceded by a filter word.  The other sentences in the string contain the word "foo"but preceded by the filter word in the same sentence.

I thought about taking your approach to eliminate sentences containing filter words or using codecruiser's tactic, but as I mentioned I can't change the internals without weeks of rework.  The exact goal is to detect the index of the occurrence of "foo" not preceded by filter words in the same sentence, then push it on for further processing.  This regex will be one among thousands stored in a DB that will be doing other kinds of word identification.  Unfortunately, I can't change the internals to use another method without reworking how all of these other regexes are used.  The process would be prohibitive.  

Without overloading you with unrelated details, it must return a match on the word foo so I can use it's index to look it up in another source.

It's beginning to sound like establishing an anchor is impossible?  Is there no statement that can look left to right and anchor when it matches "foo"  then backtrack until it finds a punctuation or string start to make sure there is no filter word?  

I could potentially set the RightToLeft RegexOption if it would somehow help?
0
 
LVL 39

Accepted Solution

by:
abel earned 500 total points
Comment Utility
Maybe I can make you happy after all. I found out (didn't know before) that the regex engine of .NET supports regular expressions inside look-behind expressions, and it's the only one so far that I know of.

The code below is scaringly equal to the code that you tried yourself. In fact, it IS equal, and I'm breaking my head about it why I didn't try your own regex first. In fact, it just works (even with the wrong \. in it). So I'm wondering how your system perhaps wraps up the regex or how it is added to an even large regex that it doesn't work for you. Here's what I tried and what printed the following to my debug window, which is correct I believe:
found: 'foo' at position 63
found: 'Foo' at position 0
found: 'foo' at position 61
found: 'foo' at position 85


- Abel --

string matchesSecond = "This searchTermA has a the word foo in it.  This bar, also has foo.  Does this searchTermB have \"foo\" in it?";

string matchesFirstOnly = "Foo statement doesn't have filter word.  SearchtermA two is foo!  Is searchTermB a foo filter?";

string matchesSecondAndThird = "This searchTermA shouldn't match foo:  This one should match foo.   In this example, foo should match too.";
 
 

Regex newRegex = new Regex("(?<!(?:searchTermA|searchTermB)[^.?!:;]*)foo", RegexOptions.IgnoreCase);

foreach(string s in new string[] {matchesSecond, matchesFirstOnly, matchesSecondAndThird})

{

    foreach (Match match in newRegex.Matches(s))

    {

        Debug.WriteLine("found: '" + match.Value + "' at position " + match.Index); 

    }

}

Open in new window

0
 

Author Comment

by:LarryLoogla
Comment Utility
You are kidding me.  Sorry if the forehead slapping sound startled you.

The issue was that for testing, it's much faster to use my regex notepad than to run everything through the .Net sandbox.  Being a regex novice, I didn't realize the extent to which the engines could differ.  Score one for Microsoft!

Thanks for the assist.  never would of known the answer was hiding in plain sight without your insights.
0
 

Author Closing Comment

by:LarryLoogla
Comment Utility
See my last comment on the page.  Also thanks for the informative explanations.
0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Thanks for the compliment. Rereading the thread shows me where I went wrong (the third sentence prb), now that I understood that requirements fully, it became apparent that you were well on the right way yourself all along.

Indeed, the regular expression engines are largely different, and luckily there's also consensus, the PCRE (perl style) regular expressions. It is at least much better then, say, 10 years ago, when the differences were unsurmountable and many languages didn't even have regex engines built in.

More info on regular expressions is on http://www.regular-expressions.info (contains engine comparison tables as well) and a must read if you do this regularly: Mastering Regular Expressions by Jeffrey Friedl.

Glad to be of help and tx for the points ;-)
0
 

Author Comment

by:LarryLoogla
Comment Utility
I'm getting more familiar with the site all the time.  It's the best resource on the web;)

Here's the final solution for future readers
(?<=(?:(?:[Ff]ilterOne|[Ff]ilterTwo)(?:[^.?!:;](?!foo))+)\s)foo

Open in new window

0
 
LVL 39

Expert Comment

by:abel
Comment Utility
Looks good on the first view. If you are using .NET regexes, consider using the "n" option, then every normal parenthesis is not a capturing one anymore, it makes it more readable in cases like yours: http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
C# code editing and collaboration 3 39
VB.Net - For Loop Error 5 23
Get String split 5 31
Unable  to create new object 9 19
Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now