asked on

Regex to exclude a domain

Hi,

I have a program that breaks down a website hit and records the metrics of it. Meaning breaks it down to elements of the page like pictures, scripts, redirects, etc. and records the loading time, etc. There is a domain that is skewing the metrics so I want to exclude it from the records.

The program accepts regex to match page elements. I came up with this regex to filter out all elements from www.foobar.com

(?:(?!foobar).)*

This doesnt seem to work all the time though. Is there another solution that can work every time? Or is there something missing from this regex to complete it?

Any help is appreciated.

Thanks,
Jose

sjklein42

Is it as simple as this?

(!(www\.foobar\.com))

sjklein42

I don't understand your example entirely, but I think you should be quoting the dot at least:

(?:(?!foobar)\.)*

Open in new window

akatsuki27

ASKER

Well, that foobar domain has different third level domains depending on what type of element we are using from it.

Basically I'm grouping the string in a passive group and using a negative lookahead to filter out foobar and everything that goes after it. That's why I have the dot there so I don't think I want to be escaping that.

--Jose

sjklein42

I have written many regex, but that approach sounds very complicated.

Please show us a few specific examples of strings you are trying to match (or not match) so we can understand why it is so difficult. Also these examples will give us something to test against.

kaufmed

What about this:

(?<!foobar)\.com

Open in new window

kaufmed

Correction:

[^/]+?(?<!foobar)\.com

Open in new window

akatsuki27

ASKER

Kaufmed,

That example filters everything out, not just the foobar elements.

sjklein42,

Well, basically let's say there's 5 elements that make up my site coming from a few different urls.

For example:

http://www.myhomepage.com/home.css ----- takes 2 seconds to load
http://static.foobar.com/picture.jpg ------ takes 1 second to load
http://videos.foobar.com/video.flv ----- takes 4 seconds to load
http://www.adserver.com/ads/for/you.jpg ----- takes 3 seconds to load
http://www.foobar.com/whatever ----- takes 3 seconds to load

Very simplified but that's the idea. So my page took 13 seconds to load. I want to filter out all foobar.com stuff.

I would be left with 2 urls for a load time of 5 seconds. That's what I want to do. You can ignore the load time stuff since it's not pertinent to the regex but that's the idea behind wanting to filter out the foobar stuff.

Like I said, depending on the type of element, the subdomain is different so I need the regex to account for that. That's why I grouped foobar. At least that was my thinking. But it's not foolproof and I don't know why.

Does this clarify my problem?

--Jose

kaufmed

Well if you're app is receiving the url including the protocol, then we'd have to modify the pattern a bit. This should handle either scenario, but you can remove the question mark and non-capturing group if you expect you will always receive "http://" to start a URL.

^(?:http://)?[^/]+?(?<!foobar)\.com

Open in new window

akatsuki27

ASKER

yes, the urls are always fqdn.

sjklein42

This pattern matches all the "foobar" lines but not the others.

(\/\/[^\/]*?\.foobar\.com\/)

Open in new window

It checks for "//" and then, before it sees the next /, it must see .foobar.com followed by the /.

This "matches" all the foobar lines. Do you need the expression negated within the pattern itself or can you just use the "else" branch in your program where this is called? [not seeing any greater context makes it a little hard to guess what you're doing - not even clear what language you're programming in).

akatsuki27

ASKER

This is a pre-built java-based app. I'm not changing its programming. I'm using an option for pattern matching that accepts regex. I don't know exactly which regex it accepts as I'm not familiar with the different types and their differences.

I want to simply negate the pattern match.

Btw, that last example didn't work. Nothing was matched.

kaufmed

Can you tell us whether or not your application is using find() vs. match()? That would make a difference on the pattern construction : )

kaufmed

Also, pleas note that since you say this is java, you will have to double-up any backslashes which are to be interpreted as regex escapes and not java escapes.

E.g.

// Wrong
^(?:http://)?[^/]+?(?<!foobar)\.com

// Correct
^(?:http://)?[^/]+?(?<!foobar)\\.com

Open in new window

kaufmed

Here's an example I worked up in Netbeans showing execution of the aforementioned pattern. Note that if I changed

m.find();

to

m.match();

the pattern would fail since match() expects to match the entire string to the regex, whereas find() just tries to match it against some substring in the source data.

public static void main(String[] args) {
    String pattern = "^(?:http://)?[^/]+?(?<!foobar)\\.com";

    System.out.println(CheckUrl(pattern, "http://www.myhomepage.com/home.css"));
    System.out.println(CheckUrl(pattern, "http://static.foobar.com/picture.jpg"));
    System.out.println(CheckUrl(pattern, "http://videos.foobar.com/video.flv"));
    System.out.println(CheckUrl(pattern, "http://www.adserver.com/ads/for/you.jpg"));
    System.out.println(CheckUrl(pattern, "http://www.foobar.com/whatever"));
}

public static boolean CheckUrl(String regex, String url) {
    Pattern p = Pattern.compile(regex);  // Find capturing parens in regex
    Matcher m = p.matcher(url);

    return m.find();
}

Open in new window

untitled.PNG

akatsuki27

ASKER

Oh I see. I don't which one it is using, I don't have that kind of access to the app but given your explanation, I think it's match() because your regex is failing for me.

ASKER CERTIFIED SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

akatsuki27

ASKER

It didnt completely answer my question but I was able to use it as a base to answer my own question.