• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 732
  • Last Modified:

DOMDocument::loadHTML seems to remove any javascript code

I am trying to use DOMDocument::loadHTML method to parse html and for some reason whenever I try to parse the html string the javascript code seems to be missing.  Could someone shed light on what could be occurring?
0
rawcoder
Asked:
rawcoder
  • 7
  • 7
1 Solution
 
woepwobinCommented:
Can you post the HTML youre trying to parse?
0
 
Ray PaseurCommented:
@woepwobin:  Exactly what you said.  The best answer would be the URL of the resource that returns the HTML string.  It amazes me how many questions are posted here at EE with no test data and no example of the expected outputs.

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
0
 
rawcoderAuthor Commented:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-gb" lang="en-gb" dir="ltr" >
<head>
	  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <meta name="robots" content="index, follow" />
  <meta name="keywords" content="" />
  <meta name="rights" content="" />
  <meta name="language" content="en-GB" />
  <meta name="title" content="Gallery" />
  <meta name="author" content="Super User" />
  <meta name="generator" content="Joomla! 1.6 - Open Source Content Management" />
  <title></title>
  <link rel="stylesheet" href="/plugins/content/artsexylightbox/artsexylightbox/css/oldsexylightbox.css" type="text/css" />
  <script src="/media/system/js/core.js" type="text/javascript"></script>
  <script src="/media/system/js/mootools-core.js" type="text/javascript"></script>
  <script src="/media/system/js/caption.js" type="text/javascript"></script>
  <script src="/plugins/content/artsexylightbox/artsexylightbox/js/jquery.js" type="text/javascript"></script>
  <script src="/plugins/content/artsexylightbox/artsexylightbox/js/jquery.easing.1.3.js" type="text/javascript"></script>
  <script src="/plugins/content/artsexylightbox/artsexylightbox/js/sexylightbox.v2.2.jquery.min.js" type="text/javascript"></script>
  <script src="/plugins/content/artsexylightbox/artsexylightbox/js/jquery.flickr.js" type="text/javascript"></script>
  <script src="/plugins/content/artsexylightbox/artsexylightbox/js/jquery.nc.js" type="text/javascript"></script>
  <script src="/media/system/js/mootools-more.js" type="text/javascript"></script>

    <link rel="stylesheet" href="/css/style.css" type="text/css" />
    <link rel="stylesheet" href="/css/innerTemplate.css" type="text/css" />
    <link rel="stylesheet" href="/css/buildTheLV.css" type="text/css" />
    <link rel="stylesheet" href="/css/planTheLV.css" type="text/css" />
    <link rel="stylesheet" href="/css/jquery.lightbox-0.5.css" type="text/css" />
</head>
<body>
	<div id="divOuterContainer">
    	<div id="divInnerContainerTop">
        	<div id="divInnerContainerMain">
            	<div id="divInnerContainerChild" style="position: relative">
                    <div class="item-page">
	<h1>
	Articles	</h1>





	

	<p class="lvNavigation">
	<a href="javascript:slidePage(690);"> BUILD</a>&nbsp;|&nbsp;<a href="javascript:slidePage(691);">EXTERIOR</a></p>
<p>
	<div class="artsexylightbox_container" id="container_4ea173b7650fe"><a href='/build/three.jpg' rel='sexylightbox[artgallery_4ea173b765180]' class='artsexylightboxpreview'  title=''><img alt='image' class='artsexylightbox'  src='/build/three.jpg' /></a><a href='/build/IMGP0055.JPG' rel='sexylightbox[artgallery_4ea173b765180]' class='artsexylightboxpreview'  title=''><img alt='image' class='artsexylightbox'  src='/build/IMGP0055.JPG' /></a></div><script type="text/javascript" charset="utf-8">asljQuery(function(){asljQuery(document).ready(function(){if (!window.sexylightboxEnabled) {SexyLightbox.initialize({"path":"images/LVGhomesgallery/johnrobertLVG/build","name":"SLB","zIndex":65555,"color":"black","find":"sexylightbox","imagesdir":"/plugins/content/artsexylightbox/artsexylightbox/images","background":"bgSexy.png","backgroundIE":"bgSexy.gif","closeButton":"SexyClose.png","showDuration":200,"showEffect":"linear","closeDuration":400,"closeEffect":"linear","moveDuration":800,"resizeDuration":800,"moveEffect":"easeOutBack","resizeEffect":"easeOutBack","loadJQuery":1,"reflHeight":56,"reflGap":2,"yRadius":40,"xPos":285,"yPos":120,"titleBox":".cloud_carousel_title"});} if (!window.sexylightboxEnabled) {window.sexylightboxEnabled = true;}})});</script></p>
	
	</div>
                    <div id="divZoomPan">
                    
                        <span id="spnInnerContainerPrev" class="spanstyle"><a id="lnkPrevious" href="#">&lt;</a><span id="spnPrevious">&lt;</span></span><span id="spnInnerContainerZoom"><span id="spnZoom" class="spanstyle" style="overflow: hidden"><a id="lnkZoom" href="#" style="height: 30px">zoom<br/><br/>
                        
                       <img id="imgZoom" src="/images/blank.gif" alt="" style="margin-top: 40px"/></a></span></span><span id="spnInnerContainerPan"><span id="spnPan" class="spanstyle" style="overflow: hidden"><a id="lnkPan" href="#" style="style: height: 30px">&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;pan<br/>
                       
                       <img id="imgPan" src="/images/blank.gif" alt="" style="margin-top: 40px"/></a></span></span><span id="spnInnerContainerNext" class="spanstyle"><a id="lnkNext" href="#">&gt;</a><span id="spnNext">&gt;</span></span>
                       
                       <img id="imgLike" src="/images/blank.gif" alt="" style="margin-top: 40px"/></a></span></span><span id="spnInnerContainerPan"><span id="spnPan" class="spanstyle" style="overflow: hidden"><a id="lnkPan" href="javascript:popUp('/index.php?option=com_content&view=article&id=333')" style="height: 30px">like&nbsp;&nbsp;</a><br/><br/>
                    <img id="imgSearch" src="/images/blank.gif" alt="" style="margin-top: 40px"/></a></span></span><span id="spnInnerContainerPan"><span id="spnPan" class="spanstyle" style="overflow: hidden"><a id="lnkPan" href="#" style="style: height: 30px">|&nbsp;&nbsp;search<br/><br/>
                    <img id="imgShare" src="/images/blank.gif" alt="" style="margin-top: 40px"/></a></span></span><span id="spnInnerContainerPan"><span id="spnPan" class="spanstyle" style="overflow: hidden"><a id="lnkPan" href="#" style="height: 30px">&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;share<br/><br/></div>
                    </div>
              	</div>
            </div>
        </div>
        <div id="divInnerContainerBottom">
        	<div id="divInnerContainerBottom1">   	
            </div>
            <div id="divInnerContainerBottom2">
            </div>
        </div>
    </div>
</body>
</html>

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
Ray PaseurCommented:
I stored the XML here: http://www.laprbass.com/RAY_temp_rawcoder.xml
I removed the XML tag and stored only the HTML here: http://www.laprbass.com/RAY_temp_rawcoder.html

Please see: http://www.laprbass.com/RAY_temp_rawcoder.php  Neither file came anywhere close to rendering a usable object.  

?
<?php // RAY_temp_rawcoder.php
error_reporting(E_ALL);
echo "<pre>";

$xml = file_get_contents('RAY_temp_rawcoder.xml');
$doc = new DomDocument;
$doc->loadHTML($xml);
var_dump($doc);

$htm = file_get_contents('RAY_temp_rawcoder.html');
$doc = new DomDocument;
$doc->loadHTML($htm);
var_dump($doc);

Open in new window

0
 
rawcoderAuthor Commented:
I did remove some of the path-related items due to security concerns.  Is there a structural problem with the html code?
0
 
Ray PaseurCommented:
There may be; I didn't study it closely -- I just assumed it was your minimum test case required to illustrate the issue, so I copied it and used it verbatim.

You can look at both the XML version and the HTML version at the links I posted above.  Those files will stay on my server for a while.
0
 
rawcoderAuthor Commented:
Could the issue be in the doctype or somewhere else?
0
 
rawcoderAuthor Commented:
The thing that is strange about that is that I can parse the body of the page, just not the script tags, even if I put the script tags in the body of the webpage.
0
 
Ray PaseurCommented:
Can you set up a minimumalist example - just an A:B comparison that isolates the issue, please?  Thanks, ~Ray
0
 
rawcoderAuthor Commented:
I am getting closer, I am now trying to use file_get_contents to return the html code from the php page, then look for the the html tag and return everything within the html tag as a string, but I am receiving the following error: Argument 1 passed to DOMDocument::importNode() must be an instance of DOMNode, null given .....
0
 
rawcoderAuthor Commented:
Is there a way to get all of the html for DOMDocument after a call to loadHTML?
0
 
Ray PaseurCommented:
Yes.   Here is a simplified example.
<?php // RAY_temp_rawcoder.php
error_reporting(E_ALL);
echo "<pre>";

// READ THE SIMPLE HTML
$htm = file_get_contents('RAY_temp_rawcoder.html');

// SHOW THE TEST DATA
echo htmlentities($htm);

// LOAD A DOCUMENT
if ($doc = new DOMDocument('1.0'))
{
    $doc->formatOutput = TRUE;
    if ($doc->loadHTML($htm))
    {
        if ($doc->saveHTMLFile('RAY_temp_rawcoder.txt'))
        {
            echo "SEE HTML: <a href=RAY_temp_rawcoder.txt>HERE</a>";
        }
        else
        {
            echo "saveHTMLFile Failed";
        }
    }
    else
    {
         echo "loadHTML Failed";
    }
}
else
{
    echo "new DOMDocument Failed";
}

Open in new window

The HTML string will be shown in the output, and you can do View Source here.
http://www.laprbass.com/RAY_temp_rawcoder.html

Can you please step back from the technical details and just tell us what you are trying to accomplish with this exercise?  Not sure, but maybe there is a better way.
0
 
rawcoderAuthor Commented:
I had to create a site that loads the body of one page(surrounding div) with another web page.  At first, I was parsing out the body section and passing that back to ajax, but I am noticing that there is javascript that needs to refresh when the new page is brought back.  Therefore, instead of just loading the html section I need to also parse out the javascript tags in the head tag and load them as well.  I think this would work if I could just get the div that contains the body and the javascript references and pass them both back to ajax.
0
 
Ray PaseurCommented:
Wow, that sounds like an R&D project.  Maybe you should make a simple JS file with something like alert('Hello'); and put it into the simple HTML file.  You might be able to see if it works that way.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 7
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now