?
Solved

PHP XML Dom and HTML Entities

Posted on 2006-07-11
7
Medium Priority
?
3,337 Views
Last Modified: 2012-08-13
Hello,

I am trying to load html document into DOM using PHP and then convert it into some xml creating my own type of xml doc (using string concatenation). I only search for specific tags inside the document and those tags become part of the resultant xml (in CDATA).

I am facing problem with entities. I can load any document from web using this script. So those html documents may have any of these entities:

 
¡
¢
£
¤
....

My problem is that dom converts those entities into their printable value.
I tried setting
$doc->substituteEntities = false;

If in the document I output the xml encoding is utf-8 there is no problem it just prints xml well in browser or is saved to a file but only thing is it converts entities for example   to space. I want that it should not touch entities in document as I traverse the document. All the entities in a document are returned as XML_TEXT_NODE by dom.

So if I try to use php htmlentities($nodeValue) to convert them back to their entity equivalent it attaches meaningless characters to it. For example:

 
¡
¢

is the result when passed through htmlentities. See  added.

So this is my problem. I have tried for few hours but haven't found any solution to this.

Also is there any simpler solution to dump just everything which is inside an element rather than traversing each child node recursively?

Example output when a dom text node containing all html entities(converted to printable by dom don't know why) is passed through php's htmlentities function. Notice the ?s and à and  added:

I noticed that when I post things like ?s and boxes are converted to � here on the forum.

 
¡
¢
£
¤
Â¥
¦
§
¨
©
ª
«
¬
­
®
¯
°
±
²
³
´
µ
¶
·
¸
¹
º
»
¼
½
¾
¿
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
à
á
â
ã
ä
Ã¥
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
õ
ö
÷
ø
ù
ú
û
ü
ý
þ
ÿ

0
Comment
Question by:Sukhwinder Singh
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
7 Comments
 
LVL 29

Expert Comment

by:TeRReF
ID: 17082014
If you are using PHP5, give simplexml a try
http://php.net/simplexml

What you could try is to convert all ampersands to &
So:
 
would become
 

That should take care of your problem...
0
 

Author Comment

by:Sukhwinder Singh
ID: 17082376
But DOM gives me all the entities already converted as a XML_TEXT_NODE:

function dump_element ($el)
{

      global  $url;
      $nodeType = $el->nodeType;
      switch ($nodeType)
      {
            case XML_TEXT_NODE:
            {
                                                                  $fp = fopen("entities.txt", "a");
                  fwrite($fp, $el->nodeValue); // adds printable versions of enities
                  fclose($fp);
                  $temp = htmlentities($el->nodeValue, ENT_QUOTES );
                                                   // Adds strange character  and other before  
                  $str .= $temp;
                                                
                  break;
              }
...........
recursive function

This is what dom returns in entities.txt above for entities:

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

when the input in html document is:

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

And output when it goes through through php's htmlentites it produces:

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
0
 

Author Comment

by:Sukhwinder Singh
ID: 17089733
PLEASE DELETE THIS QUESTION. I have found the answer.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 29

Expert Comment

by:TeRReF
ID: 17089826
If you do not share the answer here, your point will not be refunded. Of course, when you share your solution, the points will be added to your credit again :)
0
 

Author Comment

by:Sukhwinder Singh
ID: 17098433
Because the question was going to be deleted I thought there was no benifit in posting the anwer here.

It was passing the third paramenter to htmlentities that was encoding, I passed 'utf' and result seemed to be ok.
0
 

Accepted Solution

by:
ee_ai_construct earned 0 total points
ID: 17306139
PAQ / Refund
ee ai construct, community support moderator
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

718 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question