Solved

Regex For URLs Without Extension

Posted on 2012-03-26
10
233 Views
Last Modified: 2012-06-21
I have an application that uses the regex engine within Firefox to find specific links within a page.  I need to develop a regex to look for links that DON'T point to a document.  For example:
<a href="http://www.server.com/page/link">Link</a><br />
<a href="http://www.server.com/page/link.jpg">Link</a><br />

Open in new window

In the above example, I want to find the first link but NOT the second.  How can I do this?
0
Comment
Question by:CIPortAuthority
  • 5
  • 3
  • 2
10 Comments
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37767338
You might try:

<a [^>]*?href="[^"]+[^/.]+"[^>]*>

Open in new window


...however, extensions are just a human convenience. A document could very well exist which does not have an extension (common in the *nix world). It may suffice for you, though.
0
 

Author Comment

by:CIPortAuthority
ID: 37767361
I forgot to mention that the application strips out the html stuff...  So the links that get parsed are just:
http://www.server.com/page/link
http://www.server.com/page/link.jpg

Open in new window

How would the regex change?

This is something internal that I am using for a project so the naming conventions are pretty standard.  If there are actual documents with no extensions, they would be a very rare exception and its not critical to catch them all.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37767376
Does your application/regex engine support modifiers, specifically, multi-line mode?
0
 

Author Comment

by:CIPortAuthority
ID: 37767390
It runs as a plugin within Firefox so I imagine it uses whatever regex engine Firefox uses.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37768760
It might help if you provided the name (or even better, a link) for the plugin.

I'm a bit confused about the requirement "In the above example, I want to find the first link but NOT the second." - the 2nd link is an image. I wouldn't call it a document, unless it's a website which holds documents scanned to images. Are you sure you got that requirement the right way around?

If the plugin can handle negative lookaheads, the following might meet your requirement, as you stated it, provided that you can list all the file extensions used by "documents":
(?!.*\.(jpg|gif|png|bmp)$)

You may need the ignore case option turned on. How that's done depends on the tool, but sometimes this way will work:
(?i)(?!.*\.(jpg|gif|png|bmp)$)
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:CIPortAuthority
ID: 37768781
I used a jpg file as an example but it could be a .doc or .xls or .xml - basically any served file other than another web page (.htm, .html, .php, etc aren't used for extensions for web pages on the server).  

Having to know all the possible extensions would be a minor issue I guess so long as it's not going to complain about the length of the regex.  

I will give it a try in the morning and let you know.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37768820
Ok, then this might work for you:
(?!.*\.[^/.]+$)
0
 

Author Comment

by:CIPortAuthority
ID: 37772528
I have been messing around with the last regex you sent on http://regexpal.com/ and I can't seem to get it to work.  I'm not sure if regexpal supports the features your regex needs though.  I was using regexpal to test the expression before trying to plug it in to the application as it's much simpler to play with the regex that way.
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 500 total points
ID: 37773573
I tried putting the values:
http://www.server.com/page/link
and
http://www.server.com/page/link.jpg

into http://regexpal.com/ separately, and this pattern matched as you specify:
^(?!.*\.[^/.]+$).*
0
 

Author Closing Comment

by:CIPortAuthority
ID: 37781811
Thanks!  That worked once the ^ at the beginning and the .* at the end were added.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Applying regular expression in c# to a log file 5 87
URGENT!!  DT_STR Expression not working correctly 7 46
Python Regex Problem 24 123
Allow space in this pattern 2 46
by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now