Solved

Regex to exclude a domain

Posted on 2011-02-25
17
401 Views
Last Modified: 2012-06-21
Hi,

I have a program that breaks down a website hit and records the metrics of it. Meaning breaks it down to elements of the page like pictures, scripts, redirects, etc. and records the loading time, etc. There is a domain that is skewing the metrics so I want to exclude it from the records.

The program accepts regex to match page elements. I came up with this regex to filter out all elements from www.foobar.com

(?:(?!foobar).)*

This doesnt seem to work all the time though. Is there another solution that can work every time? Or is there something missing from this regex to complete it?

Any help is appreciated.

Thanks,
Jose
0
Comment
Question by:akatsuki27
  • 7
  • 6
  • 4
17 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34981674
Is it as simple as this?

(!(www\.foobar\.com))
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34981689
I don't understand your example entirely, but I think you should be quoting the dot at least:

(?:(?!foobar)\.)*

Open in new window

0
 

Author Comment

by:akatsuki27
ID: 34981999
Well, that foobar domain has different third level domains depending on what type of element we are using from it.

Basically I'm grouping the string in a passive group and using a negative lookahead to filter out foobar and everything that goes after it. That's why I have the dot there so I don't think I want to be escaping that.

--Jose
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34982139
I have written many regex, but that approach sounds very complicated.

Please show us a few specific examples of strings you are trying to match (or not match) so we can understand why it is so difficult.  Also these examples will give us something to test against.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34982188
What about this:
(?<!foobar)\.com

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34982198
Correction:
[^/]+?(?<!foobar)\.com

Open in new window

0
 

Author Comment

by:akatsuki27
ID: 34982459
Kaufmed,

That example filters everything out, not just the foobar elements.

sjklein42,

Well, basically let's say there's 5 elements that make up my site coming from a few different urls.

For example:

http://www.myhomepage.com/home.css    -----  takes 2 seconds to load
http://static.foobar.com/picture.jpg   ------   takes 1 second to load
http://videos.foobar.com/video.flv   -----   takes 4 seconds to load
http://www.adserver.com/ads/for/you.jpg   -----   takes 3 seconds to load
http://www.foobar.com/whatever   -----   takes 3 seconds to load

Very simplified but that's the idea. So my page took 13 seconds to load. I want to filter out all foobar.com stuff.

I would be left with 2 urls for a load time of 5 seconds. That's what I want to do. You can ignore the load time stuff since it's not pertinent to the regex but that's the idea behind wanting to filter out the foobar stuff.

Like I said, depending on the type of element, the subdomain is different so I need the regex to account for that. That's why I grouped foobar. At least that was my thinking. But it's not foolproof and I don't know why.

Does this clarify my problem?

--Jose
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34982505
Well if you're app is receiving the url including the protocol, then we'd have to modify the pattern a bit. This should handle either scenario, but you can remove the question mark and non-capturing group if you expect you will always receive "http://" to start a URL.
^(?:http://)?[^/]+?(?<!foobar)\.com

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:akatsuki27
ID: 34982576
yes, the urls are always fqdn.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34982642

This pattern matches all the "foobar" lines but not the others.

(\/\/[^\/]*?\.foobar\.com\/)

Open in new window


It checks for "//" and then, before it sees the next /, it must see .foobar.com followed by the /.

This "matches" all the foobar lines.   Do you need the expression negated within the pattern itself or can you just use the "else" branch in your program where this is called?  [not seeing any greater context makes it a little hard to guess what you're doing - not even clear what language you're programming in).
0
 

Author Comment

by:akatsuki27
ID: 34982821
This is a pre-built java-based app. I'm not changing its programming. I'm using an option for pattern matching that accepts regex. I don't know exactly which regex it accepts as I'm not familiar with the different types and their differences.

I want to simply negate the pattern match.

Btw, that last example didn't work. Nothing was matched.

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34982977
Can you tell us whether or not your application is using find() vs. match()? That would make a difference on the pattern construction  : )
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34982995
Also, pleas note that since you say this is java, you will have to double-up any backslashes which are to be interpreted as regex escapes and not java escapes.

E.g.
// Wrong
^(?:http://)?[^/]+?(?<!foobar)\.com

// Correct
^(?:http://)?[^/]+?(?<!foobar)\\.com

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34983109
Here's an example I worked up in Netbeans showing execution of the aforementioned pattern. Note that if I changed

    m.find();

to

    m.match();

the pattern would fail since match() expects to match the entire string to the regex, whereas find() just tries to match it against some substring in the source data.
public static void main(String[] args) {
    String pattern = "^(?:http://)?[^/]+?(?<!foobar)\\.com";

    System.out.println(CheckUrl(pattern, "http://www.myhomepage.com/home.css"));
    System.out.println(CheckUrl(pattern, "http://static.foobar.com/picture.jpg"));
    System.out.println(CheckUrl(pattern, "http://videos.foobar.com/video.flv"));
    System.out.println(CheckUrl(pattern, "http://www.adserver.com/ads/for/you.jpg"));
    System.out.println(CheckUrl(pattern, "http://www.foobar.com/whatever"));
}

public static boolean CheckUrl(String regex, String url) {
    Pattern p = Pattern.compile(regex);  // Find capturing parens in regex
    Matcher m = p.matcher(url);

    return m.find();
}

Open in new window

untitled.PNG
0
 

Author Comment

by:akatsuki27
ID: 34983296
Oh I see. I don't which one it is using, I don't have that kind of access to the app but given your explanation, I think it's match() because your regex is failing for me.
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 34983423
Let's try modding the pattern to this then [P.S. it's acutally matches() and not match()...  hey! I'm a C# guy, what can I say  ; )  ]:

^(?:http://)?[^/]+?(?<!foobar)\\.com(?:/.*)?$

Open in new window

public static void main(String[] args) {
    String pattern = "^(?:http://)?[^/]+?(?<!foobar)\\.com(?:/.*)?$";

    System.out.println(CheckUrl(pattern, "http://www.myhomepage.com/home.css"));
    System.out.println(CheckUrl(pattern, "http://static.foobar.com/picture.jpg"));
    System.out.println(CheckUrl(pattern, "http://videos.foobar.com/video.flv"));
    System.out.println(CheckUrl(pattern, "http://www.adserver.com/ads/for/you.jpg"));
    System.out.println(CheckUrl(pattern, "http://www.foobar.com/whatever"));
}

public static boolean CheckUrl(String regex, String url) {
    Pattern p = Pattern.compile(regex);  // Find capturing parens in regex
    Matcher m = p.matcher(url);

    return m.matches();
}

Open in new window

untitled.PNG
0
 

Author Closing Comment

by:akatsuki27
ID: 35369234
It didnt completely answer my question but I was able to use it as a base to answer my own question.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the fa…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now