[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Regular Expression using negation

Posted on 2005-04-29
7
Medium Priority
?
853 Views
Last Modified: 2008-02-01
Ok, yet another regular expression question... (btw, if anyone knows of a good book or resource on regular expressions, feel free to throw it my way)

So, I'm in the process of matching links found in a web page based on 4 types of criteria

Ex links:
match this         -> http://www.test.com/goodinfo/index.html
don't match this -> http://www.test.com/spam/index.html


1) The link starts with a given string (http://)
2) The link contains a given string (test.com)
3) The link ends with a given string (index.html)
4) The link doesn't contain a given string (spam)

I've implemented 1-3

^http://.*test\.com.*index\.html$

 but I'm not sure how to implement 4.  I know how to not match specific individual characters, but I really want to not match a specific string.

Can someone enlighten me?

Thanks! -A.R.
0
Comment
Question by:AaronReams
  • 3
  • 3
7 Comments
 
LVL 17

Expert Comment

by:Jesse Houwing
ID: 13903829

^http://.*test\.com(?!.*SPAMWORD).*index\.html$

This looks forward from the given point and will fail if the spamword is found. Keep in mind that the .* in front of the spamword this is a killer for performance, so if possible you should write it like this:

^http://.*test\.com/(?!SPAMWORD/)[^/]+/index\.html$

And if you need to check for multiple words:

^http://.*test\.com/(?!(?:SPAMWORD|ANOTHERWORD|YETANOTHERWORD)/)[^/]+/index\.html$

If the path is variable in length also consider this:
^http://.*test\.com/(?:(?!SPAMWORD/)[^/]+/)+index\.html$

which will check each part of the path for the spamword, but doesn't need the .*

As for the literature suggestions:
- O'Reilly - Mastering Regular Expressions 2nd Edition
- DesaWare - .Net Regular Expressions (ebook only available from their site)
0
 
LVL 96

Assisted Solution

by:Bob Learned
Bob Learned earned 200 total points
ID: 13909564
Regex Resources:

Regular Expression Library:
http://www.regexlib.com/

Regular Expression Tutorial:
http://www.regular-expressions.info/

30 minute Regular Expression Tutorial:
http://www.codeproject.com/dotnet/RegexTutorial.asp

Bob
0
 
LVL 8

Author Comment

by:AaronReams
ID: 13971283
TheLearnedOne - Thanks for the references.  I was aware of the first and last one, but the middle one looks interesting.

ToAoM - Thanks for the comprehensive answer!   I was trying out various test strings and I ran into one problem though.  I need to discard the URL if the spam string is anywhere in the string.  I think your examples only work if the spam word has leading and trailing forward slashes.  I tried adding the .* before and after the negation but that causes the negation to fail in all cases.

Many thanks for the all the help!  - Aaron

Here's a test snippet that shows what I mean...

using System;
using System.Text.RegularExpressions;

namespace RegExTestConsole
{
      class Demo
      {
            public static void Main()
            {
                  string ptrn = @"^http://.*test\.com/(?:(?!SPAMWORD/)[^/]+/)+index\.html$";
                  Regex r = new Regex( ptrn, RegexOptions.IgnoreCase );
                  Console.WriteLine( r.IsMatch("only at beginning http://www.test.com/goodinfo/index.html") );          // FALSE - correct
                  Console.WriteLine( r.IsMatch("http://www.test.com/goodinfo/index.html") );     // TRUE - correct
                  Console.WriteLine( r.IsMatch("http://www.test.com/SPAMWORD/index.html") );     // FALSE - correct
                  Console.WriteLine( r.IsMatch("http://www.test.com/anychars_SPAMWORD_anychars/index.html") );     // TRUE - not correct
                  Console.ReadLine();
            }                  
      }
}
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 17

Accepted Solution

by:
Jesse Houwing earned 1800 total points
ID: 13972766
I think this will do:

using System;
using System.Text.RegularExpressions;

namespace RegExTestConsole
{
     class Demo
     {
          public static void Main()
          {
               string ptrn = @"^http://.*test\.com/(?:(?![^/]*SPAMWORD)[^/]+/)*index\.html$";
               Regex r = new Regex( ptrn, RegexOptions.IgnoreCase );
               Console.WriteLine( r.IsMatch("only at beginning http://www.test.com/goodinfo/index.html") );          // FALSE - correct
               Console.WriteLine( r.IsMatch("http://www.test.com/goodinfo/index.html") );     // TRUE - correct
               Console.WriteLine( r.IsMatch("http://www.test.com/SPAMWORD/index.html") );     // FALSE - correct
               Console.WriteLine( r.IsMatch("http://www.test.com/anychars_SPAMWORD_anychars/index.html") );     // TRUE - not correct
               Console.ReadLine();
          }              
     }
}

0
 
LVL 8

Author Comment

by:AaronReams
ID: 13972816
Thanks, you rock! These regular expressions can be quite tricky.  I need to allocate some time to thoroughly understanding them.  

Keep up the good info!

Cheers,
Aaron
0
 
LVL 8

Author Comment

by:AaronReams
ID: 13981993
After playing around with this regular expression some more I realized that this only works if the string has ':' before the spam word and a char after '/' after the spam word.  Ideally I need to match strings on any OR all of the criteria above.  

This might require posting a new question to EE but I figured I'd ask you first.  (let me know)

Do you know how to search any continuous string of characters(until a new line \n) for a match based on the string not containing a word?  I might not always be able to specify what's at the beginning and end of the string.

Take for example

jkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjk   // MATCH
bbbbbcccccaaaaaxxxx.........SPAMWORDmfmfmfmfmfmfmfmfm  // FAIL - NO MATCH
abcdefg12345SPAMWORDyayayayayayayayayayayayayayayaya  // FAIL - NO MATCH


btw, I've searched the links TheLearnedOne suggested but it's slow going trying to figure it out on my own.
0
 
LVL 17

Expert Comment

by:Jesse Houwing
ID: 13982781
For that you can just look for the word like so:

               string ptrn = @"SPAMWORD";
               Regex r = new Regex( ptrn, RegexOptions.IgnoreCase );
               Console.WriteLine( !r.IsMatch("jkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjk") );   // MATCH
               Console.WriteLine( !r.IsMatch("bbbbbcccccaaaaaxxxx.........SPAMWORDmfmfmfmfmfmfmfmfm") );  // FAIL - NO MATCH
               Console.WriteLine( !r.IsMatch("abcdefg12345SPAMWORDyayayayayayayayayayayayayayayaya") );  // FAIL - NO MATCH
               Console.ReadLine();

Small sidenote:
"I realized that this only works if the string has ':' before the spam word" is actually wrong. "(?: )" in a regex means you don't want to capture the parts between the braces as a sub-match. It improves speed if you write (?:) instead of ().

Also note that  string ptrn = @"^http://.*test\.com(?:/(?![^/]*SPAMWORD)[^/]+)*/index\.html$"; might an be even better regex (note the slight changes to the order) if you don't want to change the behaviour.
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In order to hide the "ugly" records selectors (triangles) in the rowheaders, here are some suggestions. Microsoft doesn't have a direct method/property to do it. You can only hide the rowheader column. First solution, the easy way The first sol…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
Integration Management Part 2
Are you ready to place your question in front of subject-matter experts for more timely responses? With the release of Priority Question, Premium Members, Team Accounts and Qualified Experts can now identify the emergent level of their issue, signal…

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question