• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 207
  • Last Modified:

Help creating a matching expression

Hi,

I am trying to extract some text from an HTML feed on Yahoo Pipes. I admit I am no good with regular expressions (they hate me, and secretly conspire against me).

It's basic: I am trying to extract the price between <span class="tgProductPrice"></span>.

I am using this expression, but it's gobbling up extra text: .*<span class="tgProductPrice">(.*)</span>.*


I just need the "$.98" (dollar sign, numbers, and decimal point).

Thanks,
Ryan
Output from the expression:

$0.98<br /> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>


Original Output, sans expression:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">294 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
rossryan
Asked:
rossryan
  • 20
  • 6
1 Solution
 
rossryanAuthor Commented:
Right, I am trying to get the data between a pair of html tags.
0
 
SuperdaveCommented:
.*<span class="tgProductPrice">([^<]*)</span>.*

Most expression-matching languages/libraries wouldn't need the .* at the beginning or the end, either, but I don't know what you're using so I left it be.

The idea is to batch anything except open-bracket, so you don't match too much.

Using .*? instead of .* is another way of doing that, if you have Perl/Python-compatible regular expressions.  [^<] should be even more portable though.
0
 
rossryanAuthor Commented:
Nicely done. I will up the points to 500 (you're already getting the 350), if you can craft two more expressions:

<span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>

I need to extract that title.

And:

<span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>

And the extraction of those tags.


If you can respond quickly, I could create two extra questions out of those, if that would work for you.


Thanks,
Ryan
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
SuperdaveCommented:
Try:

<span class="tgRssTitle[^>]*>([^<]*)

and

href="([^"]*)

0
 
rossryanAuthor Commented:
For some odd reason, I can never get these things to work properly. And I am under the gun to get this done in an hour or so.
0
 
rossryanAuthor Commented:
This is the output for <span class="tgRssTitle[^>]*>([^<]*).

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div>I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 
rossryanAuthor Commented:
Switching to:.*<span class="tgRssTitle[^>]*>([^<]*)


gives me the below:
I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br />  <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div>

Open in new window

0
 
rossryanAuthor Commented:
My above comment was close, but I need to ditch the rest of that string.

Any ideas?
0
 
rossryanAuthor Commented:
.*<span class="tgRssTitle[^>]*>([^<]*) \(<span.* works.


Now I need that last expression to work.
0
 
rossryanAuthor Commented:
.*<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/.*">([^<]*)</a>.* isn't working.
0
 
SuperdaveCommented:
What does the last one need to extract, the hrefs or the text of the links?

If it's the hrefs, try this (and the second captured match is the rest of the string, you'll have to repeat the search, unless you have a findall function):
.*href="([^"]*)(.*)

Otherwise, for the text:

.*href="[^"]* *>([^<]*)(.*)

0
 
rossryanAuthor Commented:
The text of the links.

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>


Should extract dvd from between the anchor links opening and closing tags.
0
 
rossryanAuthor Commented:
Yes, it hates: .*href="[^"]* *>([^<]*)(.*)


Gives me this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 
rossryanAuthor Commented:
Mmmmmmmmmmm. Can you fix it?
0
 
SuperdaveCommented:
Sorry, I missed a ".  Took me a while to figure it out!

'.*?href="[^"]*" *>([^<]*)(.*)'
0
 
rossryanAuthor Commented:
Hmmm.

This is not working.

If this is the html source:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96)

I need to extract "science fiction" from between those tags. The href portion changes after /tag/, and it's not detecting it properly.
0
 
rossryanAuthor Commented:
With this expression: '.*?href="[^"]*" *>([^<]*)(.*)'


I get this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

0
 
rossryanAuthor Commented:
And with this expression: .*?href="[^"]*" *>([^<]*)(.*)

I get no output.
0
 
rossryanAuthor Commented:
So close, and yet so far. With that last expression, I am not getting any output...
desktop.png
0
 
SuperdaveCommented:
Maybe try this, leaving out the optional blank I allowed for after the URL, since there doesn't seem to be one.  The first ones I typed in off the top of my head, but the last one I tested in Python and it works.  What language or R.E. library are you using?

.*?href="[^"]*">([^<]*)(.*)
0
 
rossryanAuthor Commented:
It's supposedly Perl-like. It's Yahoo's whatever you want to call it.

0
 
rossryanAuthor Commented:
No dice.

When I tell it to select from the second field, and I use this expression:.*<a ([^>]*)>([^<]*)</a>.*


I get this: last man on earth
0
 
rossryanAuthor Commented:
Hmm.

If this is the source HTML:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>



Why is this backfiring so badly?
0
 
rossryanAuthor Commented:
Should be able to extract based on:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">

Perhaps an expression that looks for: ref=tag_rss_rs_itdp_item_at
0
 
rossryanAuthor Commented:
Hmm.


tag_rss_rs_itdp_item_at(.*)


is just dumping me the source html.
0
 
SuperdaveCommented:
The one that you said gave "last man on earth" looks pretty close.  If the R.E.'s really are Perl/Python-like, you should be able to use *? to get the first one instead of the last one like this:

.*?<a ([^>]*)>([^<]*)</a>.*

Or, leave it like you had it and just capture the beginning:

(.*)<a ([^>]*)>([^<]*)</a>.*

Then loop, geting the title from the second value and matching again on the first value.  Or do the same thing forward instead of backward:

.*?<a ([^>]*)>([^<]*)</a>(.*)

In your last comment (tag_rss_rs_itdp_item_at(.*)) you're probably getting everything from the first tag_rss_rs_itdp_item_at till the end of the file; you need to match up to some kind of terminator like

tag_rss_rs_itdp_item_at">(.*?))<


0
 
käµfm³d 👽Commented:
Try this modification to one of SuperDave's earlier patterns:
.*?<a [^>]*?href=[^>]+>([^<]*).*

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 20
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now