?
Solved

Regular Expression parsing URL

Posted on 2007-08-02
9
Medium Priority
?
2,515 Views
Last Modified: 2010-10-05
This should be a quick on.  I have a pattern that breaks a HTML hyperlink into the address and the display text:
<\s*a.*href\s*=\s*"?(.*?)"?[^>]>(.*?)</\s*a>
This may not be ideal, but it works well enough for me with one exception.  Most of the urls I'm parsing are formatted:
<A HREF="/cgi-bin/show_case_doc?2,576695,,,">2</a>
This works out fine.  Two groups return, the first with the address (/cgi-bin/show_case_doc?2,576695,,,) and the second with the displayed text (2).
However, there are some URLs that leave out the quotation marks:
<A HREF=/cgi-bin/show_case_doc?2,576695,,,>2</a>
For some reason, when this happens the displayed text (2) returns fine, but the address truncates the last comma (/cgi-bin/show_case_doc?2,576695,,) and I can't figure out why.  Granted, I'm new to regular expressions.

I'm sure it's some stupid little thing I missed, but I'm at a loss.
0
Comment
Question by:mcorrente
  • 3
  • 3
  • 3
9 Comments
 
LVL 27

Accepted Solution

by:
ddrudik earned 600 total points
ID: 19618242
I would try:
<\s*a.*href\s*=\s*"?(.*?)"?>(.*?)</\s*a>
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 19618364
You might also consider:
<\s*a.*href\s*=\s*"?([^ "'>]*).*>(.*?)</\s*a>

This would cover these additional HREF formats:
<A HREF=/cgi-bin/show_case_doc?2,576695,,, class="x">2</a>
<A HREF='/cgi-bin/show_case_doc?2,576695,,,'>2</a>
0
 
LVL 54

Assisted Solution

by:b0lsc0tt
b0lsc0tt earned 340 total points
ID: 19618404
The comment above took out the part that was causing the problem.  It was [^>] in the expression.  In the first example that matched " which was OK but it matched , in the URL in the second example.  Is the href always the last property before the closing >?  For example do you ever have <a href="URL" target="_blank"> or something like it?  I think your expression would still have issues so I doubt any of your tags are that complex.

Based on the 2 examples above I suggest you could simplify your expression to ...

<a.*href="?([^>]+?)"?>(.*?)</a>

Let me know if you have a question or how it works.  Ddrudik's comment may be all you need but I thought I would point out the problem and why and offer an alternative.

bol
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 6

Author Comment

by:mcorrente
ID: 19618663
That worked.  I increased points a bit cause I'd love it if you could explain why that would affect it.  Like I said, I'm new to regular expressions but I do a lot of string parsing - and this seems like a really powerful tool if I can get a handle on it.
0
 
LVL 6

Author Comment

by:mcorrente
ID: 19618666
sry, posted before i refreshed. Lemme read.
0
 
LVL 54

Expert Comment

by:b0lsc0tt
ID: 19618706
Let me know if you still have a question or need an explanation.

Expressions are great and a neat tool.  If you want help learning or using them, besides this site, I recommend RegEx buddy and the tutorial with the program and the site (http://www.regexbuddy.com/).

bol
0
 
LVL 6

Author Comment

by:mcorrente
ID: 19618711
Ok, I think I see.  [^>] actually matched the last character, so it didn't include it in the group.  Gotcha.

No, the URLs always follow this type of format.  It's automated HTML, so if it changes I'd address it then.  Parsing documents is always subject to those types of problems, so I don't mind that.

I increased points for an explanation from ddrudik before I saw your post, so ddrudik will get at least the points he would have gotten for the correct answer and still allow me to give you points for your additional comments.  Thanks to both.
0
 
LVL 54

Expert Comment

by:b0lsc0tt
ID: 19618842
Very fair. :)  I'm glad I could help a bit and didn't have to take something from Ddrudik.  Thanks for the grade, the points and the fun question.

bol
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 19619428
Thanks for the question and the points, I was away from the keyboard during your last posts but I see b0lsc0tt has those answered.
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the fa…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses

807 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question