asked on

Regex For URLs Without Extension

I have an application that uses the regex engine within Firefox to find specific links within a page. I need to develop a regex to look for links that DON'T point to a document. For example:

<a href="http://www.server.com/page/link">Link</a><br />
<a href="http://www.server.com/page/link.jpg">Link</a><br />

Open in new window

In the above example, I want to find the first link but NOT the second. How can I do this?

kaufmed

You might try:

<a [^>]*?href="[^"]+[^/.]+"[^>]*>

Open in new window

...however, extensions are just a human convenience. A document could very well exist which does not have an extension (common in the *nix world). It may suffice for you, though.

CIPortAuthority

ASKER

I forgot to mention that the application strips out the html stuff... So the links that get parsed are just:

http://www.server.com/page/link
http://www.server.com/page/link.jpg

Open in new window

How would the regex change?

This is something internal that I am using for a project so the naming conventions are pretty standard. If there are actual documents with no extensions, they would be a very rare exception and its not critical to catch them all.

kaufmed

Does your application/regex engine support modifiers, specifically, multi-line mode?

CIPortAuthority

ASKER

It runs as a plugin within Firefox so I imagine it uses whatever regex engine Firefox uses.

Terry Woods

It might help if you provided the name (or even better, a link) for the plugin.

I'm a bit confused about the requirement "In the above example, I want to find the first link but NOT the second." - the 2nd link is an image. I wouldn't call it a document, unless it's a website which holds documents scanned to images. Are you sure you got that requirement the right way around?

If the plugin can handle negative lookaheads, the following might meet your requirement, as you stated it, provided that you can list all the file extensions used by "documents":
(?!.*\.(jpg|gif|png|bmp)$)

You may need the ignore case option turned on. How that's done depends on the tool, but sometimes this way will work:
(?i)(?!.*\.(jpg|gif|png|bmp)$)

CIPortAuthority

ASKER

I used a jpg file as an example but it could be a .doc or .xls or .xml - basically any served file other than another web page (.htm, .html, .php, etc aren't used for extensions for web pages on the server).

Having to know all the possible extensions would be a minor issue I guess so long as it's not going to complain about the length of the regex.

I will give it a try in the morning and let you know.

Terry Woods

Ok, then this might work for you:
(?!.*\.[^/.]+$)

CIPortAuthority

ASKER

I have been messing around with the last regex you sent on http://regexpal.com/ and I can't seem to get it to work. I'm not sure if regexpal supports the features your regex needs though. I was using regexpal to test the expression before trying to plug it in to the application as it's much simpler to play with the regex that way.

ASKER CERTIFIED SOLUTION

Terry Woods

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CIPortAuthority

ASKER

Thanks! That worked once the ^ at the beginning and the .* at the end were added.