Link to home
Start Free TrialLog in
Avatar of rossryan
rossryan

asked on

Help creating a matching expression

Hi,

I am trying to extract some text from an HTML feed on Yahoo Pipes. I admit I am no good with regular expressions (they hate me, and secretly conspire against me).

It's basic: I am trying to extract the price between <span class="tgProductPrice"></span>.

I am using this expression, but it's gobbling up extra text: .*<span class="tgProductPrice">(.*)</span>.*


I just need the "$.98" (dollar sign, numbers, and decimal point).

Thanks,
Ryan
Output from the expression:

$0.98<br /> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>


Original Output, sans expression:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">294 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

Avatar of rossryan
rossryan

ASKER

Right, I am trying to get the data between a pair of html tags.
.*<span class="tgProductPrice">([^<]*)</span>.*

Most expression-matching languages/libraries wouldn't need the .* at the beginning or the end, either, but I don't know what you're using so I left it be.

The idea is to batch anything except open-bracket, so you don't match too much.

Using .*? instead of .* is another way of doing that, if you have Perl/Python-compatible regular expressions.  [^<] should be even more portable though.
Nicely done. I will up the points to 500 (you're already getting the 350), if you can craft two more expressions:

<span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>

I need to extract that title.

And:

<span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span>

And the extraction of those tags.


If you can respond quickly, I could create two extra questions out of those, if that would work for you.


Thanks,
Ryan
Try:

<span class="tgRssTitle[^>]*>([^<]*)

and

href="([^"]*)

For some odd reason, I can never get these things to work properly. And I am under the gun to get this done in an hour or so.
This is the output for <span class="tgRssTitle[^>]*>([^<]*).

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div>I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

Switching to:.*<span class="tgRssTitle[^>]*>([^<]*)


gives me the below:
I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br />  <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div>

Open in new window

My above comment was close, but I need to ditch the rest of that string.

Any ideas?
.*<span class="tgRssTitle[^>]*>([^<]*) \(<span.* works.


Now I need that last expression to work.
.*<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/.*">([^<]*)</a>.* isn't working.
What does the last one need to extract, the hrefs or the text of the links?

If it's the hrefs, try this (and the second captured match is the rest of the string, you'll have to repeat the search, unless you have a findall function):
.*href="([^"]*)(.*)

Otherwise, for the text:

.*href="[^"]* *>([^<]*)(.*)

The text of the links.

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>


Should extract dvd from between the anchor links opening and closing tags.
Yes, it hates: .*href="[^"]* *>([^<]*)(.*)


Gives me this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E" id="tag_rso_rs_eofr_used">295 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

Mmmmmmmmmmm. Can you fix it?
Sorry, I missed a ".  Took me a while to figure it out!

'.*?href="[^"]*" *>([^<]*)(.*)'
Hmmm.

This is not working.

If this is the html source:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction">science fiction</a>(96)

I need to extract "science fiction" from between those tags. The href portion changes after /tag/, and it's not detecting it properly.
With this expression: '.*?href="[^"]*" *>([^<]*)(.*)'


I get this output:
<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>

Open in new window

And with this expression: .*?href="[^"]*" *>([^<]*)(.*)

I get no output.
So close, and yet so far. With that last expression, I am not getting any output...
desktop.png
Maybe try this, leaving out the optional blank I allowed for after the URL, since there doesn't seem to be one.  The first ones I typed in off the top of my head, but the last one I tested in Python and it works.  What language or R.E. library are you using?

.*?href="[^"]*">([^<]*)(.*)
It's supposedly Perl-like. It's Yahoo's whatever you want to call it.

No dice.

When I tell it to select from the second field, and I use this expression:.*<a ([^>]*)>([^<]*)</a>.*


I get this: last man on earth
Hmm.

If this is the source HTML:

<div class="hreview" style="clear:both;"><div class="item"><div style="float:left;" class="tgRssImage"><a rel="nofollow" class="url" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_url"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0"/>  </a></div><span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display:block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a rel="nofollow" target="_blank" href="http://www.amazon.com/I-Am-Legend-Widescreen-Single-Disc/dp/B0013FDM7E/ref=tag_rso_rs_edpp_new">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a rel="nofollow" target="_blank" href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">292 used and new</a> from <span class="tgProductPrice">$0.98</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0"/><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(96), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(83), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(53), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(46), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(36), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(15), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(7), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/last%20man%20on%20earth/ref=tag_rss_rs_itdp_item_at">last man on earth</a>(4)<br /></span> </div></div>



Why is this backfiring so badly?
Should be able to extract based on:

<a rel="nofollow" target="_blank" href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">

Perhaps an expression that looks for: ref=tag_rss_rs_itdp_item_at
Hmm.


tag_rss_rs_itdp_item_at(.*)


is just dumping me the source html.
ASKER CERTIFIED SOLUTION
Avatar of Superdave
Superdave
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of kaufmed
Try this modification to one of SuperDave's earlier patterns:
.*?<a [^>]*?href=[^>]+>([^<]*).*

Open in new window