Solved

Stripping a block of HTML into JUST the images used in the IMG tags

Posted on 2006-06-10
12
267 Views
Last Modified: 2013-12-24
I am looking to strip a large block of HTML into just the <IMG> tags. For example, I'd like to turn the following code:

<p>
<img src="images/top.jpg" width="100" alt="hey there!">
<br>
Hey check this out!<br>
<img src="gallery/checkthisout.jpg" width="500" border="0">
</p>

Into a list of just the image files referenced, ie.

"images/top.jpg", "gallery/checkthisout.jpg"

If anyone could help me on the road to doing this, I'd be really appreciative.

0
Comment
Question by:bombrider
  • 5
  • 4
  • 2
12 Comments
 
LVL 25

Expert Comment

by:dgrafx
ID: 16877153
First, Read your file:<br>
<CFFILE ACTION="READ" file="d:\_web\path to a file\test.html" variable="str">

<cfset startstring="<img">
<cfset endstring=">">
<cfset parsed="">
<cfset images="">
<cfloop list="#str#" index="ii" delimiters="#chr(10)##chr(13)#">
<cfif listvaluecountnocase(ii,startstring,"#chr(32)##chr(9)#") gt 1>
      <cfloop list="#ii#" index="jj" delimiters="<">
      <CFSET start = findnocase("img",jj)>
      <cfif start>
      <cfset end = findnocase(endstring,jj,start)+len(endstring)>
      <cfif end gt start>
      <cfset parsed = ListAppend(parsed,"<" & trim(MID(jj,start,end-start)))>
      </cfif>
      </cfif>
      </cfloop>
<cfelse>
      <CFSET start = findnocase(startstring,ii)>
      <cfif start>
      <cfset end = findnocase(endstring,ii,start)+len(endstring)>
      <cfif end gt start>
      <cfset parsed = ListAppend(parsed,trim(MID(ii,start,end-start)))>
      </cfif>
      </cfif>
      </cfif>            
</cfloop>
Here are your img tags:<br>
<br>#replace(htmlcodeformat(parsed),",","<br>","all")#<br>

<cfset startstring="src=#chr(34)#">
<cfset endstring="#chr(34)#">
<cfloop list="#parsed#" index="kk">
<CFSET start = findnocase(startstring,kk)+len(startstring)>
<cfif start>
<cfset end = findnocase(endstring,kk,start)>
<cfif end gt start>
<cfset images = ListAppend(images,trim(MID(kk,start,end-start)))>
</cfif>
</cfif>
</cfloop>
And Here is your image list:<br>
#listqualify(images,chr(34))#

by appreciative, do you mean increasing points?
:)
0
 

Author Comment

by:bombrider
ID: 16877217
That almost works. It only displays one image from my HTML code which contains 3 images.

Almost there I guess!
0
 
LVL 25

Expert Comment

by:dgrafx
ID: 16877268
Are the 3 images right next to each other with no spaces like:
<img src="xyz.jpg"><img src="wer.jpg"><img src="abc.jpg">
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:bombrider
ID: 16879414
For the purposes I am needing the script for, they may or may not be next to eachother in that format, so the script needs to accommodate both.. I am parsing HTML, this script is intended to weed out just the value of the image source (IMG SRC="") from large blocks of HTML that will include tables, font tags, css, etc.

There may be scenarios where images are directly next to eachother as you have put in your example. Does that make sense? :D

Thanks!
0
 
LVL 25

Assisted Solution

by:dgrafx
dgrafx earned 125 total points
ID: 16879457
the reason i asked is because i believe it won't parse correctly if stacked together like above - without something in between - space, tab etc

try this:

right after the <CFFILE ACTION="READ" file="d:\_web\path to a file\test.html" variable="str">
Put this:
<cfset str=replacenocase(str,"<img"," <img","all")>
0
 
LVL 7

Accepted Solution

by:
aseusainc earned 125 total points
ID: 16882809
You can replace the cfhttp with a cffile is that is the method you are using.  It will return a comma delimited list of all images used on a page.  I did not code any dupe checking, but it works exactly as you want.

Try this:

<CFHTTP Method="GET"
 URL="http://www.experts-exchange.com"
 UserAgent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
 Redirect="No">
</cfhttp>
 
<cfset start = 1>
<cfset loopstop = 0>
<cfset imagelist = "">

<CFLOOP condition="loopstop EQ 0">
  <cfset start = findnocase('<img src="',CFHTTP.FileContent,start)>
  <cfif start EQ 0>
    <cfset loopstop = 1>
  <cfelse>
    <cfset start = start + 10>
    <cfset end = findnocase('"',CFHTTP.FileContent,start)>
    <cfset count = end - start>
    <cfset image = mid(CFHTTP.FileContent,start,count)>
      <cfset imagelist = listappend(imagelist,image,',')>
  </cfif>
</cfloop>
<cfoutput>#imagelist#</cfoutput>
0
 
LVL 25

Expert Comment

by:dgrafx
ID: 16884748
yes, that's a good way of going about it - except that where it fails is that one cannot count on the image tag being <img src=.
It can easily be <img alt= or <img height= or <img id= etcetera...
and that is the main reason for going about it the way I did.
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 16884815
So would changing

<cfset start = findnocase('<img src="',CFHTTP.FileContent,start)>

to

<cfset start = findnocase('src="',CFHTTP.FileContent,start)>

fix it?  There any other tags that use "src="?
0
 
LVL 25

Expert Comment

by:dgrafx
ID: 16885196
no, what I used (if you look at my code) is to find "<img" (all img tags start with "<img".
Then from that point find 'src="'
works everytime!

I like your condition loop!
I crawl directories using that method - wish I would have thought of it this time :)
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 16885345
Fixed!  I changed it to find "<IMG" 1st, then "src=" from there.  Give it a whirl :)



<CFHTTP Method="GET"
 URL="http://www.experts-exchange.com"
 UserAgent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
 Redirect="No">
</cfhttp>
 
<cfset start = 1>
<cfset loopstop = 0>
<cfset imagelist = "">

<CFLOOP condition="loopstop EQ 0">
  <cfset start = findnocase('<img',CFHTTP.FileContent,start)>
  <cfif start EQ 0>
    <cfset loopstop = 1>
  <cfelse>
    <cfset start = findnocase('src="',CFHTTP.FileContent,start)>
    <cfset start = start + 5>
    <cfset end = findnocase('"',CFHTTP.FileContent,start)>
    <cfset count = end - start>
    <cfset image = mid(CFHTTP.FileContent,start,count)>
      <cfset imagelist = listappend(imagelist,image,',')>
  </cfif>
</cfloop>
<cfoutput>#imagelist#</cfoutput>
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 17051557
Suggest assist between aseusainc and dgrafx as a correct answer was provided.
0

Featured Post

Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

One of the typical problems I have experienced is when you have to move a web server from one hosting site to another. You normally prepare all on the new host, transfer the site, change DNS and cross your fingers hoping all will be ok on new server…
When it comes to showing a 404 error page to your visitors, you do not want that generic page to show, and you especially do not want your hosting provider’s ad error page to show either. In this article, I will show you how to enable the custom 40…
This Micro Tutorial hows how you can integrate  Mac OSX to a Windows Active Directory Domain. Apple has made it easy to allow users to bind their macs to a windows domain with relative ease. The following video show how to bind OSX Mavericks to …
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question