Solved

Help creating a matching expression

Posted on 2011-02-18
27
195 Views
Last Modified: 2012-08-14
Hi,

I am trying to extract some text from an HTML feed on Yahoo Pipes. I admit I am no good with regular expressions (they hate me, and secretly conspire against me).

It's basic: I am trying to extract the price between <span class="tgProductPrice"></span>.

I am using this expression, but it's gobbling up extra text: .*<span class="tgProductPrice">(.*)</span>.*


I just need the "$.98" (dollar sign, numbers, and decimal point).

Thanks,
Ryan
Output from the expression:

$0.98<br /> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>


Original Output, sans expression:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">294 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
Comment
Question by:rossryan
  • 20
  • 6
27 Comments
 

Author Comment

by:rossryan
ID: 34930632
Right, I am trying to get the data between a pair of html tags.
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34930693
.*<span class="tgProductPrice">([^<]*)</span>.*

Most expression-matching languages/libraries wouldn't need the .* at the beginning or the end, either, but I don't know what you're using so I left it be.

The idea is to batch anything except open-bracket, so you don't match too much.

Using .*? instead of .* is another way of doing that, if you have Perl/Python-compatible regular expressions.  [^<] should be even more portable though.
0
 

Author Comment

by:rossryan
ID: 34930755
Nicely done. I will up the points to 500 (you're already getting the 350), if you can craft two more expressions:

<span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>

I need to extract that title.

And:

<span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>

And the extraction of those tags.


If you can respond quickly, I could create two extra questions out of those, if that would work for you.


Thanks,
Ryan
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34930813
Try:

<span class="tgRssTitle[^>]*>([^<]*)

and

href="([^"]*)

0
 

Author Comment

by:rossryan
ID: 34930817
For some odd reason, I can never get these things to work properly. And I am under the gun to get this done in an hour or so.
0
 

Author Comment

by:rossryan
ID: 34930827
This is the output for <span class="tgRssTitle[^>]*>([^<]*).

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div>I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34930841
Switching to:.*<span class="tgRssTitle[^>]*>([^<]*)


gives me the below:
I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br />  <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34930849
My above comment was close, but I need to ditch the rest of that string.

Any ideas?
0
 

Author Comment

by:rossryan
ID: 34930917
.*<span class="tgRssTitle[^>]*>([^<]*) \(<span.* works.


Now I need that last expression to work.
0
 

Author Comment

by:rossryan
ID: 34930960
.*<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/.*">([^<]*)</a>.* isn't working.
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34930967
What does the last one need to extract, the hrefs or the text of the links?

If it's the hrefs, try this (and the second captured match is the rest of the string, you'll have to repeat the search, unless you have a findall function):
.*href="([^"]*)(.*)

Otherwise, for the text:

.*href="[^"]* *>([^<]*)(.*)

0
 

Author Comment

by:rossryan
ID: 34930975
The text of the links.

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>


Should extract dvd from between the anchor links opening and closing tags.
0
 

Author Comment

by:rossryan
ID: 34930985
Yes, it hates: .*href="[^"]* *>([^<]*)(.*)


Gives me this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:rossryan
ID: 34931058
Mmmmmmmmmmm. Can you fix it?
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34931065
Sorry, I missed a ".  Took me a while to figure it out!

'.*?href="[^"]*" *>([^<]*)(.*)'
0
 

Author Comment

by:rossryan
ID: 34931090
Hmmm.

This is not working.

If this is the html source:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96)

I need to extract "science fiction" from between those tags. The href portion changes after /tag/, and it's not detecting it properly.
0
 

Author Comment

by:rossryan
ID: 34931097
With this expression: '.*?href="[^"]*" *>([^<]*)(.*)'


I get this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 

Author Comment

by:rossryan
ID: 34931105
And with this expression: .*?href="[^"]*" *>([^<]*)(.*)

I get no output.
0
 

Author Comment

by:rossryan
ID: 34931111
So close, and yet so far. With that last expression, I am not getting any output...
desktop.png
0
 
LVL 13

Expert Comment

by:Superdave
ID: 34931121
Maybe try this, leaving out the optional blank I allowed for after the URL, since there doesn't seem to be one.  The first ones I typed in off the top of my head, but the last one I tested in Python and it works.  What language or R.E. library are you using?

.*?href="[^"]*">([^<]*)(.*)
0
 

Author Comment

by:rossryan
ID: 34931142
It's supposedly Perl-like. It's Yahoo's whatever you want to call it.

0
 

Author Comment

by:rossryan
ID: 34931157
No dice.

When I tell it to select from the second field, and I use this expression:.*<a ([^>]*)>([^<]*)</a>.*


I get this: last man on earth
0
 

Author Comment

by:rossryan
ID: 34931180
Hmm.

If this is the source HTML:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>



Why is this backfiring so badly?
0
 

Author Comment

by:rossryan
ID: 34931192
Should be able to extract based on:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">

Perhaps an expression that looks for: ref=tag_rss_rs_itdp_item_at
0
 

Author Comment

by:rossryan
ID: 34931202
Hmm.


tag_rss_rs_itdp_item_at(.*)


is just dumping me the source html.
0
 
LVL 13

Accepted Solution

by:
Superdave earned 500 total points
ID: 34931522
The one that you said gave "last man on earth" looks pretty close.  If the R.E.'s really are Perl/Python-like, you should be able to use *? to get the first one instead of the last one like this:

.*?<a ([^>]*)>([^<]*)</a>.*

Or, leave it like you had it and just capture the beginning:

(.*)<a ([^>]*)>([^<]*)</a>.*

Then loop, geting the title from the second value and matching again on the first value.  Or do the same thing forward instead of backward:

.*?<a ([^>]*)>([^<]*)</a>(.*)

In your last comment (tag_rss_rs_itdp_item_at(.*)) you're probably getting everything from the first tag_rss_rs_itdp_item_at till the end of the file; you need to match up to some kind of terminator like

tag_rss_rs_itdp_item_at">(.*?))<


0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34932841
Try this modification to one of SuperDave's earlier patterns:
.*?<a [^>]*?href=[^>]+>([^<]*).*

Open in new window

0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now