kingent85
asked on
Parse Html File
I have an html file that I need to parse. The html file needs to be unchangeable. I simply need to pull information from tags. There are multiple of the same tags and multiple tables. I have found something that I think may work but I'm not sure if it will.
When I simply do just one table the info pulls just fine, however when I put another one in it crashes.
My goal is to import certain information from this html file into an array so that I can pull it into my database.
I'm attaching the main code and it references a test file that I created. This is just a simple file. Very small so I'll paste in here
----------------------code ---------- ---------- ----
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
Hello World!
<p>Hello Dustin!</p>
<table id="CDFTradeDetailFull1" cellspacing="0" cellpadding="0" width="650" summary="" border="0">
<tbody>
<tr>
<td class="arial11Black" valign="bottom"><br />
<strong>ALLTEL
COMMUNICATIONS </strong></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</body>
</html>
-------------------------e nd code---------------------- -
Please help
When I simply do just one table the info pulls just fine, however when I put another one in it crashes.
My goal is to import certain information from this html file into an array so that I can pull it into my database.
I'm attaching the main code and it references a test file that I created. This is just a simple file. Very small so I'll paste in here
----------------------code
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
Hello World!
<p>Hello Dustin!</p>
<table id="CDFTradeDetailFull1" cellspacing="0" cellpadding="0" width="650" summary="" border="0">
<tbody>
<tr>
<td class="arial11Black" valign="bottom"><br />
<strong>ALLTEL
COMMUNICATIONS </strong></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</body>
</html>
-------------------------e
Please help
<?php
/**
* HTML/XML Parser Class
*
* This is a helper class that is used to parse HTML and XML. A unique feature of this parsing class
* is the fact that it includes support for innerHTML (which isn't easy to do).
*
* @author Dennis Pallett
* @copyright Dennis Pallett 2006
* @package HTML_Parser
* @version 1.0
*/
// Helper Class
// To parse HTML/XML
Class HTML_Parser {
// Private properties
var $_parser;
var $_tags = array();
var $_html;
var $output = array();
var $strXmlData;
var $_level = 0;
var $_outline;
var $_tagcount = array();
var $xml_error = false;
var $xml_error_code;
var $xml_error_string;
var $xml_error_line_number;
function get_html () {
return $this->_html;
}
function parse($strInputXML) {
$this->output = array();
// Translate entities
$strInputXML = $this->translate_entities($strInputXML);
$this->_parser = xml_parser_create ();
xml_parser_set_option($this->_parser, XML_OPTION_CASE_FOLDING, true);
xml_set_object($this->_parser,$this);
xml_set_element_handler($this->_parser, "tagOpen", "tagClosed");
xml_set_character_data_handler($this->_parser, "tagData");
$this->strXmlData = xml_parse($this->_parser,$strInputXML );
if (!$this->strXmlData) {
$this->xml_error = true;
$this->xml_error_code = xml_get_error_code($this->_parser);
$this->xml_error_string = xml_error_string(xml_get_error_code($this->_parser));
$this->xml_error_line_number = xml_get_current_line_number($this->_parser);
return false;
}
return $this->output;
}
function tagOpen($parser, $name, $attr) {
// Increase level
$this->_level++;
// Create tag:
$newtag = $this->create_tag($name, $attr);
// Build tag
$tag = array("name"=>$name,"attr"=>$attr, "level"=>$this->_level);
// Add tag
array_push ($this->output, $tag);
// Add tag to this level
$this->_tags[$this->_level] = $tag;
// Add to HTML
$this->_html .= $newtag;
// Add to outline
$this->_outline .= $this->_level . $newtag;
}
function create_tag ($name, $attr) {
// Create tag:
# Begin with name
$tag = '<' . strtolower($name) . ' ';
# Create attribute list
foreach ($attr as $key=>$val) {
$tag .= strtolower($key) . '="' . htmlentities($val) . '" ';
}
# Finish tag
$tag = trim($tag);
switch(strtolower($name)) {
case 'br':
case 'input':
$tag .= ' /';
break;
}
$tag .= '>';
return $tag;
}
function tagData($parser, $tagData) {
if(trim($tagData)) {
if(isset($this->output[count($this->output)-1]['tagData'])) {
$this->output[count($this->output)-1]['tagData'] .= $tagData;
} else {
$this->output[count($this->output)-1]['tagData'] = $tagData;
}
}
$this->_html .= htmlentities($tagData);
$this->_outline .= htmlentities($tagData);
}
function tagClosed($parser, $name) {
// Add to HTML and outline
switch (strtolower($name)) {
case 'br':
case 'input':
break;
default:
$this->_outline .= $this->_level . '</' . strtolower($name) . '>';
$this->_html .= '</' . strtolower($name) . '>';
}
// Get tag that belongs to this end
$tag = $this->_tags[$this->_level];
$tag = $this->create_tag($tag['name'], $tag['attr']);
// Try to get innerHTML
$regex = '%' . preg_quote($this->_level . $tag, '%') . '(.*?)' . preg_quote($this->_level . '</' . strtolower($name) . '>', '%') . '%is';
preg_match ($regex, $this->_outline, $matches);
// Get innerHTML
if (isset($matches['1'])) {
$innerhtml = $matches['1'];
}
// Remove level identifiers
$this->_outline = str_replace($this->_level . $tag, $tag, $this->_outline);
$this->_outline = str_replace($this->_level . '</' . strtolower($name) . '>', '</' . strtolower($name) . '>', $this->_outline);
// Add innerHTML
if (isset($innerhtml)) {
$this->output[count($this->output)-1]['innerhtml'] = $innerhtml;
}
// Fix tree
$this->output[count($this->output)-2]['children'][] = $this->output[count($this->output)-1];
array_pop($this->output);
// Decrease level
$this->_level--;
}
function translate_entities($xmlSource, $reverse =FALSE) {
static $literal2NumericEntity;
if (empty($literal2NumericEntity)) {
$transTbl = get_html_translation_table(HTML_ENTITIES);
foreach ($transTbl as $char => $entity) {
if (strpos('&"<>', $char) !== FALSE) continue;
$literal2NumericEntity[$entity] = '&#'.ord($char).';';
}
}
if ($reverse) {
return strtr($xmlSource, array_flip($literal2NumericEntity));
} else {
return strtr($xmlSource, $literal2NumericEntity);
}
}
}
//#####################################
// get contents of a file into a string
//$filename = "testfile.html";
//$handle = fopen($filename, "r");
//$html = fread($handle, filesize($filename));
//fclose($handle);
//#####################################
//#####################################
// get contents of a file into a string
$filename = "testfile.html";
$handle = fopen($filename, "r");
$html = fread($handle, filesize($filename));
fclose($handle);
//#####################################
// To be used like this
$parser = new HTML_Parser;
$output = $parser->parse($html);
$tag = $output['0'];
$text = $tag['children']['1']['tagData'];
$text3 = $tag[innerhtml];
$text2 = $tag['children']['1']['children']['1']['children']['0']['children']['0']['children']['0']['children']['1']['tagData'];
echo "Text is $text<br>$text2<bR>";
//echo "$text3<br>";
echo "<pre>";
print_r ($output);
echo "</pre>";
?>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
What does the $this value need to be set up as? When I put that in it just returns a $this instead of the variable. I'd like to see where it shows the errors.
Also, the problem is the html is generated by going to file and savas html document. Then we want to parse through that file. We are on a secure server and can't access it directly so we have to save it then parse it. To edit the file manually for everyone is a bit hectic but will do if I have too.
Thanks
Also, the problem is the html is generated by going to file and savas html document. Then we want to parse through that file. We are on a secure server and can't access it directly so we have to save it then parse it. To edit the file manually for everyone is a bit hectic but will do if I have too.
Thanks
ASKER
Alright I've gone through the html and I see that there are tons of mistakes in the code. What is a better way to export to html via web browser for this to work?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I'm trying to get data displayed via the web. I need to get Company name, Date Opened, etc... What is the best way to save the source code? Should I parse it as a txt file or what exactly? There may be 5 Of the same type.
------Example-----
Company name: Crednology
Date Opened: 08/24/2006
Date Closed: NA
-----End------
That's the kind of information I'm trying to pull. I've played with setting up variables and name them Company name etc and it worked except I couldn't figure out how to have multiple Company Names etc and actually work. Also because it was tables inside of tables it tended not to work.
I'd like to be able to bypass the end tags and mistakes because we can't edit the html file everytime.
------Example-----
Company name: Crednology
Date Opened: 08/24/2006
Date Closed: NA
-----End------
That's the kind of information I'm trying to pull. I've played with setting up variables and name them Company name etc and it worked except I couldn't figure out how to have multiple Company Names etc and actually work. Also because it was tables inside of tables it tended not to work.
I'd like to be able to bypass the end tags and mistakes because we can't edit the html file everytime.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Well I have it parsed now and have it going out to an array where I can specify the element via [children][0][tag] etc... depending on where it is in the array.
Although now that I have it filtered out what I'd like to do is be able to do something like this
-------------------example ---------- ---------- ---------- --
<table>
<tr>
<td>
Creditor: American Express</td>
<br> Date Opened: 08/05/08</br>
</tr>
<tr>
<td>
Creditor: Discover</td>
<br> Date Opened: 08/06/08</br>
Date Closed: na
</tr>
</table>
-------------------end---- ---------- ---------- --------
I have it so the html is proper so there won't be any errors. What's the best way to search through that and pull out Creditor and Date Opened so that it looks like.
----------------------outp ut-------- ---------- ---------- --
American Express
08/05/08
Discover
08/06/08
---------------------end-- ---------- ---------- ---------- ---
Just pulls out exactly what's needed. Notice I put date closed in there but did not put inside of the example of the output. This means that I'd like to be able to specify what to look for and it search through and pull it out. I've setup an array that kind of did this, but it didn't really pan out. The problem was I didn't know how to have it loop to pull multiple things like the creditor and date open for everyone that exsisted like in the example above. See I need to be able to place a distinct value on it so that I can pul it into the database.
Any ideas?
Although now that I have it filtered out what I'd like to do is be able to do something like this
-------------------example
<table>
<tr>
<td>
Creditor: American Express</td>
<br> Date Opened: 08/05/08</br>
</tr>
<tr>
<td>
Creditor: Discover</td>
<br> Date Opened: 08/06/08</br>
Date Closed: na
</tr>
</table>
-------------------end----
I have it so the html is proper so there won't be any errors. What's the best way to search through that and pull out Creditor and Date Opened so that it looks like.
----------------------outp
American Express
08/05/08
Discover
08/06/08
---------------------end--
Just pulls out exactly what's needed. Notice I put date closed in there but did not put inside of the example of the output. This means that I'd like to be able to specify what to look for and it search through and pull it out. I've setup an array that kind of did this, but it didn't really pan out. The problem was I didn't know how to have it loop to pull multiple things like the creditor and date open for everyone that exsisted like in the example above. See I need to be able to place a distinct value on it so that I can pul it into the database.
Any ideas?
ASKER
I'll split the points
ASKER
Open in new window