Solved

Help creating a matching expression

Posted on 2011-02-18
27
202 Views
Last Modified: 2012-08-14
Hi,

I am trying to extract some text from an HTML feed on Yahoo Pipes. I admit I am no good with regular expressions (they hate me, and secretly conspire against me).

It's basic: I am trying to extract the price between <span class="tgProductPrice"></span>.

I am using this expression, but it's gobbling up extra text: .*<span class="tgProductPrice">(.*)</span>.*


I just need the "$.98" (dollar sign, numbers, and decimal point).

Thanks,
Ryan
Output from the expression:

$0.98<br /> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>


Original Output, sans expression:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">294 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
Comment
Question by:rossryan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 20
  • 6
27 Comments
 

Author Comment

by:rossryan
ID: 34930632
Right, I am trying to get the data between a pair of html tags.
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34930693
.*<span class="tgProductPrice">([^<]*)</span>.*

Most expression-matching languages/libraries wouldn't need the .* at the beginning or the end, either, but I don't know what you're using so I left it be.

The idea is to batch anything except open-bracket, so you don't match too much.

Using .*? instead of .* is another way of doing that, if you have Perl/Python-compatible regular expressions.  [^<] should be even more portable though.
0
 

Author Comment

by:rossryan
ID: 34930755
Nicely done. I will up the points to 500 (you're already getting the 350), if you can craft two more expressions:

<span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>

I need to extract that title.

And:

<span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>

And the extraction of those tags.


If you can respond quickly, I could create two extra questions out of those, if that would work for you.


Thanks,
Ryan
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 13

Expert Comment

by:Superdave
ID: 34930813
Try:

<span class="tgRssTitle[^>]*>([^<]*)

and

href="([^"]*)

0
 

Author Comment

by:rossryan
ID: 34930817
For some odd reason, I can never get these things to work properly. And I am under the gun to get this done in an hour or so.
0
 

Author Comment

by:rossryan
ID: 34930827
This is the output for <span class="tgRssTitle[^>]*>([^<]*).

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div>I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34930841
Switching to:.*<span class="tgRssTitle[^>]*>([^<]*)


gives me the below:
I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br />  <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34930849
My above comment was close, but I need to ditch the rest of that string.

Any ideas?
0
 

Author Comment

by:rossryan
ID: 34930917
.*<span class="tgRssTitle[^>]*>([^<]*) \(<span.* works.


Now I need that last expression to work.
0
 

Author Comment

by:rossryan
ID: 34930960
.*<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/.*">([^<]*)</a>.* isn't working.
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34930967
What does the last one need to extract, the hrefs or the text of the links?

If it's the hrefs, try this (and the second captured match is the rest of the string, you'll have to repeat the search, unless you have a findall function):
.*href="([^"]*)(.*)

Otherwise, for the text:

.*href="[^"]* *>([^<]*)(.*)

0
 

Author Comment

by:rossryan
ID: 34930975
The text of the links.

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>


Should extract dvd from between the anchor links opening and closing tags.
0
 

Author Comment

by:rossryan
ID: 34930985
Yes, it hates: .*href="[^"]* *>([^<]*)(.*)


Gives me this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34931058
Mmmmmmmmmmm. Can you fix it?
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34931065
Sorry, I missed a ".  Took me a while to figure it out!

'.*?href="[^"]*" *>([^<]*)(.*)'
0
 

Author Comment

by:rossryan
ID: 34931090
Hmmm.

This is not working.

If this is the html source:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96)

I need to extract "science fiction" from between those tags. The href portion changes after /tag/, and it's not detecting it properly.
0
 

Author Comment

by:rossryan
ID: 34931097
With this expression: '.*?href="[^"]*" *>([^<]*)(.*)'


I get this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34931105
And with this expression: .*?href="[^"]*" *>([^<]*)(.*)

I get no output.
0
 

Author Comment

by:rossryan
ID: 34931111
So close, and yet so far. With that last expression, I am not getting any output...
desktop.png
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34931121
Maybe try this, leaving out the optional blank I allowed for after the URL, since there doesn't seem to be one.  The first ones I typed in off the top of my head, but the last one I tested in Python and it works.  What language or R.E. library are you using?

.*?href="[^"]*">([^<]*)(.*)
0
 

Author Comment

by:rossryan
ID: 34931142
It's supposedly Perl-like. It's Yahoo's whatever you want to call it.

0
 

Author Comment

by:rossryan
ID: 34931157
No dice.

When I tell it to select from the second field, and I use this expression:.*<a ([^>]*)>([^<]*)</a>.*


I get this: last man on earth
0
 

Author Comment

by:rossryan
ID: 34931180
Hmm.

If this is the source HTML:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>



Why is this backfiring so badly?
0
 

Author Comment

by:rossryan
ID: 34931192
Should be able to extract based on:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">

Perhaps an expression that looks for: ref=tag_rss_rs_itdp_item_at
0
 

Author Comment

by:rossryan
ID: 34931202
Hmm.


tag_rss_rs_itdp_item_at(.*)


is just dumping me the source html.
0
 
LVL 13

Accepted Solution

by:
Superdave earned 500 total points
ID: 34931522
The one that you said gave "last man on earth" looks pretty close.  If the R.E.'s really are Perl/Python-like, you should be able to use *? to get the first one instead of the last one like this:

.*?<a ([^>]*)>([^<]*)</a>.*

Or, leave it like you had it and just capture the beginning:

(.*)<a ([^>]*)>([^<]*)</a>.*

Then loop, geting the title from the second value and matching again on the first value.  Or do the same thing forward instead of backward:

.*?<a ([^>]*)>([^<]*)</a>(.*)

In your last comment (tag_rss_rs_itdp_item_at(.*)) you're probably getting everything from the first tag_rss_rs_itdp_item_at till the end of the file; you need to match up to some kind of terminator like

tag_rss_rs_itdp_item_at">(.*?))<


0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34932841
Try this modification to one of SuperDave's earlier patterns:
.*?<a [^>]*?href=[^>]+>([^<]*).*

Open in new window

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses

628 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question