Grabbing HTML included into XML elements

Hi,

I have the following XML-Code I want to parse  using PHP 4:

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE GlossarXML [
  <!ELEMENT EXPLANATION (SHORT, DETAILED, BIBLIOGRAPHY, LINKS )>
    <!ELEMENT SHORT (#PCDATA)>
    <!ELEMENT DETAILED (#PCDATA)>
    <!ELEMENT BIBLIOGRAPHY (SOURCE)>
      <!ELEMENT SOURCE (#PCDATA)>
    <!ELEMENT HLINKS (HLINK)>
      <!ELEMENT HLINK (HLINKTITLE, HLINKREF)>
        <!ELEMENT HLINKTITLE (#PCDATA)>
        <!ELEMENT HLINKREF (#PCDATA)>  
]>
 
<EXPLANATION>
  <SHORT>
     <p>this is a <b><font color="red">short</font></b> explanation showed on WAP devices.</p>
  </SHORT>
  <DETAILED>
     <p>this is the <b><font color="green">detailed</font></b> text displayed on non-WAP devices.</p>
  </DETAILED>
  <BIBLIOGRAPHY>
     <SOURCE>
        Berliner Abendblatt, 3. Ausgabe, Seite 3, Absatz 4
     </SOURCE>
  </BIBLIOGRAPHY>
  <HLINKS>
     <HLINK>
        <HLINKTITLE>Heise Verlag</HLINKTITLE>
        <HLINKREF>http://www.heise.de</HLINKREF>
     </HLINK>
  </HLINKS>
</EXPLANATION>


So, what I need is a PHP script that parses the XML Code (assume it is contained in a string $xml_content).

The Parser has to grab the content of the XML-Elements and sub elements and put it into an associative array.

BUT for the elements SHORT and DETAILED it only has to return the content of the surrounding XML-tags at a whole, but not the content for every included HTML tag!

Who can help?

WebFerretAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
hernst42Connect With a Mentor Commented:
With a xml-parser it would look like, you just have to implode the tags back to text in certain circumstances:


function MENUstartElement($parser, $name, $attrs) {
    switch (strtoupper($name)) {
        case 'SHORT':
            $GLOBALS['inTag']=true;
            $GLOBALS['TagType']='s';
            break;

        case 'DETAILED':
            $GLOBALS['inTag']=true;
            $GLOBALS['TagType']='d';
            break;

        default:
            if ($GLOBALS['inTag']) {
                $GLOBALS['tmpTagContent'] .= "<$name";
                if ( count($attrs) >0) {
                    foreach ($attrs as $n => $v) {
                        $GLOBALS['tmpTagContent'] .= " $n=\"$v\"";
                    }
                }
                $GLOBALS['tmpTagContent'] .= '>';
            }
    }
}

function MENUendElement($parser, $name) {
    switch (strtoupper($name)) {
        case 'SHORT':
        case 'DETAILED':
            $GLOBALS['inTag']=false;
            $GLOBALS['extractedText'][$GLOBALS['TagType']][] = $GLOBALS['tmpTagContent'];
            $GLOBALS['tmpTagContent'] = '';
            break;

        default:
            if ($GLOBALS['inTag']) {
                $GLOBALS['tmpTagContent'] .= "</$name>";
            }
    }
}

function MENUcharacterData($parser, $data) {
    if ($GLOBALS['inTag']) {
        $GLOBALS['tmpTagContent'] .= $data;
    }
}

$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,0);
xml_set_element_handler($xml_parser, "MENUstartElement", "MENUendElement");
xml_set_character_data_handler($xml_parser, "MENUcharacterData");

xml_parse($xml_parser, $xml_content, true );
printf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($xml_parser)),
                                        xml_get_current_line_number($xml_parser));

xml_parser_free($xml_parser);

var_dump($GLOBALS['extractedText']);

0
 
hernst42Commented:
Try using this regular expression for that case:

preg_match_all('/<(SHORT|DETAILED)>(.*)<\/\1>/iUs', $xml_content, $m);
var_dump($m);

$m has the following structure:
count($m[1]) : number of found short and detailed tags
$m[1][$i] = type of $i th tag (short or detailed) found
$m[2][$i] = content of the $i th tag
0
 
WebFerretAuthor Commented:
LOL okay, I forgot to mention that I don't want a RegEx solution but a generic solution that detects recognizes HTML code and gives back HTML code as a single object and not splitted into the separate HTML tags...

0
 
merwetta1Commented:
i think your solution will require some RegEx. what's wrong with a RegEx solution and what is a "generic solution"?
0
 
hernst42Commented:
   Accept: hernst42 {http:#11849343}
0
All Courses

From novice to tech pro — start learning today.