Solved

Regex to exclude a domain

Posted on 2011-02-25
17
404 Views
Last Modified: 2012-06-21
Hi,

I have a program that breaks down a website hit and records the metrics of it. Meaning breaks it down to elements of the page like pictures, scripts, redirects, etc. and records the loading time, etc. There is a domain that is skewing the metrics so I want to exclude it from the records.

The program accepts regex to match page elements. I came up with this regex to filter out all elements from www.foobar.com

(?:(?!foobar).)*

This doesnt seem to work all the time though. Is there another solution that can work every time? Or is there something missing from this regex to complete it?

Any help is appreciated.

Thanks,
Jose
0
Comment
Question by:akatsuki27
  • 7
  • 6
  • 4
17 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34981674
Is it as simple as this?

(!(www\.foobar\.com))
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34981689
I don't understand your example entirely, but I think you should be quoting the dot at least:

(?:(?!foobar)\.)*

Open in new window

0
 

Author Comment

by:akatsuki27
ID: 34981999
Well, that foobar domain has different third level domains depending on what type of element we are using from it.

Basically I'm grouping the string in a passive group and using a negative lookahead to filter out foobar and everything that goes after it. That's why I have the dot there so I don't think I want to be escaping that.

--Jose
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 16

Expert Comment

by:sjklein42
ID: 34982139
I have written many regex, but that approach sounds very complicated.

Please show us a few specific examples of strings you are trying to match (or not match) so we can understand why it is so difficult.  Also these examples will give us something to test against.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34982188
What about this:
(?<!foobar)\.com

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34982198
Correction:
[^/]+?(?<!foobar)\.com

Open in new window

0
 

Author Comment

by:akatsuki27
ID: 34982459
Kaufmed,

That example filters everything out, not just the foobar elements.

sjklein42,

Well, basically let's say there's 5 elements that make up my site coming from a few different urls.

For example:

http://www.myhomepage.com/home.css    -----  takes 2 seconds to load
http://static.foobar.com/picture.jpg   ------   takes 1 second to load
http://videos.foobar.com/video.flv   -----   takes 4 seconds to load
http://www.adserver.com/ads/for/you.jpg   -----   takes 3 seconds to load
http://www.foobar.com/whatever   -----   takes 3 seconds to load

Very simplified but that's the idea. So my page took 13 seconds to load. I want to filter out all foobar.com stuff.

I would be left with 2 urls for a load time of 5 seconds. That's what I want to do. You can ignore the load time stuff since it's not pertinent to the regex but that's the idea behind wanting to filter out the foobar stuff.

Like I said, depending on the type of element, the subdomain is different so I need the regex to account for that. That's why I grouped foobar. At least that was my thinking. But it's not foolproof and I don't know why.

Does this clarify my problem?

--Jose
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34982505
Well if you're app is receiving the url including the protocol, then we'd have to modify the pattern a bit. This should handle either scenario, but you can remove the question mark and non-capturing group if you expect you will always receive "http://" to start a URL.
^(?:http://)?[^/]+?(?<!foobar)\.com

Open in new window

0
 

Author Comment

by:akatsuki27
ID: 34982576
yes, the urls are always fqdn.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34982642

This pattern matches all the "foobar" lines but not the others.

(\/\/[^\/]*?\.foobar\.com\/)

Open in new window


It checks for "//" and then, before it sees the next /, it must see .foobar.com followed by the /.

This "matches" all the foobar lines.   Do you need the expression negated within the pattern itself or can you just use the "else" branch in your program where this is called?  [not seeing any greater context makes it a little hard to guess what you're doing - not even clear what language you're programming in).
0
 

Author Comment

by:akatsuki27
ID: 34982821
This is a pre-built java-based app. I'm not changing its programming. I'm using an option for pattern matching that accepts regex. I don't know exactly which regex it accepts as I'm not familiar with the different types and their differences.

I want to simply negate the pattern match.

Btw, that last example didn't work. Nothing was matched.

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34982977
Can you tell us whether or not your application is using find() vs. match()? That would make a difference on the pattern construction  : )
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34982995
Also, pleas note that since you say this is java, you will have to double-up any backslashes which are to be interpreted as regex escapes and not java escapes.

E.g.
// Wrong
^(?:http://)?[^/]+?(?<!foobar)\.com

// Correct
^(?:http://)?[^/]+?(?<!foobar)\\.com

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34983109
Here's an example I worked up in Netbeans showing execution of the aforementioned pattern. Note that if I changed

    m.find();

to

    m.match();

the pattern would fail since match() expects to match the entire string to the regex, whereas find() just tries to match it against some substring in the source data.
public static void main(String[] args) {
    String pattern = "^(?:http://)?[^/]+?(?<!foobar)\\.com";

    System.out.println(CheckUrl(pattern, "http://www.myhomepage.com/home.css"));
    System.out.println(CheckUrl(pattern, "http://static.foobar.com/picture.jpg"));
    System.out.println(CheckUrl(pattern, "http://videos.foobar.com/video.flv"));
    System.out.println(CheckUrl(pattern, "http://www.adserver.com/ads/for/you.jpg"));
    System.out.println(CheckUrl(pattern, "http://www.foobar.com/whatever"));
}

public static boolean CheckUrl(String regex, String url) {
    Pattern p = Pattern.compile(regex);  // Find capturing parens in regex
    Matcher m = p.matcher(url);

    return m.find();
}

Open in new window

untitled.PNG
0
 

Author Comment

by:akatsuki27
ID: 34983296
Oh I see. I don't which one it is using, I don't have that kind of access to the app but given your explanation, I think it's match() because your regex is failing for me.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 34983423
Let's try modding the pattern to this then [P.S. it's acutally matches() and not match()...  hey! I'm a C# guy, what can I say  ; )  ]:

^(?:http://)?[^/]+?(?<!foobar)\\.com(?:/.*)?$

Open in new window

public static void main(String[] args) {
    String pattern = "^(?:http://)?[^/]+?(?<!foobar)\\.com(?:/.*)?$";

    System.out.println(CheckUrl(pattern, "http://www.myhomepage.com/home.css"));
    System.out.println(CheckUrl(pattern, "http://static.foobar.com/picture.jpg"));
    System.out.println(CheckUrl(pattern, "http://videos.foobar.com/video.flv"));
    System.out.println(CheckUrl(pattern, "http://www.adserver.com/ads/for/you.jpg"));
    System.out.println(CheckUrl(pattern, "http://www.foobar.com/whatever"));
}

public static boolean CheckUrl(String regex, String url) {
    Pattern p = Pattern.compile(regex);  // Find capturing parens in regex
    Matcher m = p.matcher(url);

    return m.matches();
}

Open in new window

untitled.PNG
0
 

Author Closing Comment

by:akatsuki27
ID: 35369234
It didnt completely answer my question but I was able to use it as a base to answer my own question.
0

Featured Post

Secure Your Active Directory - April 20, 2017

Active Directory plays a critical role in your company’s IT infrastructure and keeping it secure in today’s hacker-infested world is a must.
Microsoft published 300+ pages of guidance, but who has the time, money, and resources to implement? Register now to find an easier way.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the fa…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

685 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question