Solved

Stripping a block of HTML into JUST the images used in the IMG tags

Posted on 2006-06-10
12
264 Views
Last Modified: 2013-12-24
I am looking to strip a large block of HTML into just the <IMG> tags. For example, I'd like to turn the following code:

<p>
<img src="images/top.jpg" width="100" alt="hey there!">
<br>
Hey check this out!<br>
<img src="gallery/checkthisout.jpg" width="500" border="0">
</p>

Into a list of just the image files referenced, ie.

"images/top.jpg", "gallery/checkthisout.jpg"

If anyone could help me on the road to doing this, I'd be really appreciative.

0
Comment
Question by:bombrider
  • 5
  • 4
  • 2
12 Comments
 
LVL 24

Expert Comment

by:dgrafx
ID: 16877153
First, Read your file:<br>
<CFFILE ACTION="READ" file="d:\_web\path to a file\test.html" variable="str">

<cfset startstring="<img">
<cfset endstring=">">
<cfset parsed="">
<cfset images="">
<cfloop list="#str#" index="ii" delimiters="#chr(10)##chr(13)#">
<cfif listvaluecountnocase(ii,startstring,"#chr(32)##chr(9)#") gt 1>
      <cfloop list="#ii#" index="jj" delimiters="<">
      <CFSET start = findnocase("img",jj)>
      <cfif start>
      <cfset end = findnocase(endstring,jj,start)+len(endstring)>
      <cfif end gt start>
      <cfset parsed = ListAppend(parsed,"<" & trim(MID(jj,start,end-start)))>
      </cfif>
      </cfif>
      </cfloop>
<cfelse>
      <CFSET start = findnocase(startstring,ii)>
      <cfif start>
      <cfset end = findnocase(endstring,ii,start)+len(endstring)>
      <cfif end gt start>
      <cfset parsed = ListAppend(parsed,trim(MID(ii,start,end-start)))>
      </cfif>
      </cfif>
      </cfif>            
</cfloop>
Here are your img tags:<br>
<br>#replace(htmlcodeformat(parsed),",","<br>","all")#<br>

<cfset startstring="src=#chr(34)#">
<cfset endstring="#chr(34)#">
<cfloop list="#parsed#" index="kk">
<CFSET start = findnocase(startstring,kk)+len(startstring)>
<cfif start>
<cfset end = findnocase(endstring,kk,start)>
<cfif end gt start>
<cfset images = ListAppend(images,trim(MID(kk,start,end-start)))>
</cfif>
</cfif>
</cfloop>
And Here is your image list:<br>
#listqualify(images,chr(34))#

by appreciative, do you mean increasing points?
:)
0
 

Author Comment

by:bombrider
ID: 16877217
That almost works. It only displays one image from my HTML code which contains 3 images.

Almost there I guess!
0
 
LVL 24

Expert Comment

by:dgrafx
ID: 16877268
Are the 3 images right next to each other with no spaces like:
<img src="xyz.jpg"><img src="wer.jpg"><img src="abc.jpg">
0
 

Author Comment

by:bombrider
ID: 16879414
For the purposes I am needing the script for, they may or may not be next to eachother in that format, so the script needs to accommodate both.. I am parsing HTML, this script is intended to weed out just the value of the image source (IMG SRC="") from large blocks of HTML that will include tables, font tags, css, etc.

There may be scenarios where images are directly next to eachother as you have put in your example. Does that make sense? :D

Thanks!
0
 
LVL 24

Assisted Solution

by:dgrafx
dgrafx earned 125 total points
ID: 16879457
the reason i asked is because i believe it won't parse correctly if stacked together like above - without something in between - space, tab etc

try this:

right after the <CFFILE ACTION="READ" file="d:\_web\path to a file\test.html" variable="str">
Put this:
<cfset str=replacenocase(str,"<img"," <img","all")>
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 7

Accepted Solution

by:
aseusainc earned 125 total points
ID: 16882809
You can replace the cfhttp with a cffile is that is the method you are using.  It will return a comma delimited list of all images used on a page.  I did not code any dupe checking, but it works exactly as you want.

Try this:

<CFHTTP Method="GET"
 URL="http://www.experts-exchange.com"
 UserAgent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
 Redirect="No">
</cfhttp>
 
<cfset start = 1>
<cfset loopstop = 0>
<cfset imagelist = "">

<CFLOOP condition="loopstop EQ 0">
  <cfset start = findnocase('<img src="',CFHTTP.FileContent,start)>
  <cfif start EQ 0>
    <cfset loopstop = 1>
  <cfelse>
    <cfset start = start + 10>
    <cfset end = findnocase('"',CFHTTP.FileContent,start)>
    <cfset count = end - start>
    <cfset image = mid(CFHTTP.FileContent,start,count)>
      <cfset imagelist = listappend(imagelist,image,',')>
  </cfif>
</cfloop>
<cfoutput>#imagelist#</cfoutput>
0
 
LVL 24

Expert Comment

by:dgrafx
ID: 16884748
yes, that's a good way of going about it - except that where it fails is that one cannot count on the image tag being <img src=.
It can easily be <img alt= or <img height= or <img id= etcetera...
and that is the main reason for going about it the way I did.
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 16884815
So would changing

<cfset start = findnocase('<img src="',CFHTTP.FileContent,start)>

to

<cfset start = findnocase('src="',CFHTTP.FileContent,start)>

fix it?  There any other tags that use "src="?
0
 
LVL 24

Expert Comment

by:dgrafx
ID: 16885196
no, what I used (if you look at my code) is to find "<img" (all img tags start with "<img".
Then from that point find 'src="'
works everytime!

I like your condition loop!
I crawl directories using that method - wish I would have thought of it this time :)
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 16885345
Fixed!  I changed it to find "<IMG" 1st, then "src=" from there.  Give it a whirl :)



<CFHTTP Method="GET"
 URL="http://www.experts-exchange.com"
 UserAgent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
 Redirect="No">
</cfhttp>
 
<cfset start = 1>
<cfset loopstop = 0>
<cfset imagelist = "">

<CFLOOP condition="loopstop EQ 0">
  <cfset start = findnocase('<img',CFHTTP.FileContent,start)>
  <cfif start EQ 0>
    <cfset loopstop = 1>
  <cfelse>
    <cfset start = findnocase('src="',CFHTTP.FileContent,start)>
    <cfset start = start + 5>
    <cfset end = findnocase('"',CFHTTP.FileContent,start)>
    <cfset count = end - start>
    <cfset image = mid(CFHTTP.FileContent,start,count)>
      <cfset imagelist = listappend(imagelist,image,',')>
  </cfif>
</cfloop>
<cfoutput>#imagelist#</cfoutput>
0
 
LVL 7

Expert Comment

by:aseusainc
ID: 17051557
Suggest assist between aseusainc and dgrafx as a correct answer was provided.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

One of the typical problems I have experienced is when you have to move a web server from one hosting site to another. You normally prepare all on the new host, transfer the site, change DNS and cross your fingers hoping all will be ok on new server…
If you don't have the right permissions set for your WordPress location in IIS, you won't be able to perform automatic updates. Here's how to fix the problem.
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now