Link to home
Start Free TrialLog in
Avatar of agrarian3
agrarian3

asked on

How to get rid of ¿ in html?

I have a xml -> html process using a proprietary xsl file which I can view, but not change. Some of the resultant html includes some characters which will not display in a browser except as "¿".

My first question is how can I find out exactly what character lies behind this ¿? I think one of the ¿s is the unicode character
†

Open in new window

which displays as a footnote dagger (†). But how can I be sure (since there are others as well).

My second question is how can I then search and replace these characters with their html entity equivelants? My code
$str = preg_replace('/\#8224',	'†',	$str);	// dagger

Open in new window

isn't working.
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

You almost certainly have a character code collision of some sort.  This article explains the background you'll need to understand.
http://www.joelonsoftware.com/articles/Unicode.html

Is the document being displayed in a UTF-8 HTML page?  If not, try that.  And if you can post a link to the XML file that contains some of the errant characters, I'll be glad to have a look.
Avatar of agrarian3
agrarian3

ASKER

Thank you for the link to the article. It was a good read.

But it doesn't help me find out what is the underlying code that appears as a question mark in a black diamond (displayed as an upside-down question mark above.)

I'm dealing with xml provided by government databases created by individual consumers using various programs, which the government xsl file translates. These characters are created by the government xsl file. It appears to be the decimal unicode character I mentioned above (from looking at the xsl file).

What I want to do is "clean up" the resultant html file by replacing certain characters with their html entity equivalents.

The document is being displayed in a UTF-8 HTML page. I've put a copy of my resultant html out on: http://pdr3d.reedfax.com/tst/test.html Although I was originally writing about the black diamonds in Table 2, you will notice several where bullets belong.
Doing a search+replace to find these characters is almost always a losing battle.  You'll find more and more of them, and eventually the code that's supposed to make the page readable gets broken and makes the page a complete disaster.

Generally this problem can be fixed by setting the proper encoding.  Presumably others consume this message without getting such errors, so you'll need to find out what type of encoding they're using.

For more information about encoding, I would HIGHLY suggest reading this: [ http://www.joelonsoftware.com/articles/Unicode.html ].
xml provided by government databases...
Not exactly an unexpected problem; the governments have no reason to comply with standards or be accountable to any particular constituency, unless there is marked political pressure.  Let's try it another way.  What is your original source of the data, and what is your expected output from processing the data?

Also, have you considered changing the doctype to a standards-compliant expression?  When you render in quirks-mode there are many things that can go wrong.
Still wondering... What is your original source of the data?  If you can give us a link to the data (before it is processed further by your scripts) we may be able to suggest something.
Thanks, Ray, for your persistence. My project got interrupted by our annual self evaluations and filling out forms.

The original source of data are xml files provided by the government (however they are created by private companies and may contain non-standard MS Word characters. Using XMLSpy to view two different source documents, you can see that the data can use all sorts of values:
<name>Meridian Medical Technologies™, Inc.</name>

Open in new window

where TM is the ansi character 99
BIOS LIFE&#xae; SLIM&#x2122;

Open in new window

in this case it looks like we're being given UTF-8 coding.
These names get stored in a SQL database and when retrieved look like ansi characters (meaning that they look like the (TM) and (R) symbols).
BIOS LIFE® SLIM™

Open in new window

where the TM is the superscripted trademark code.
Problem is when I try to use these values in my web script, it breaks at these characters saying they are invalid.

In my application, I have to display the information two different ways:
1) Use the source xml files to display a complete document using PHP and XSLTProcessor to transform the xml to a DOM object and then save to html to display.
2) Use the xml snippets that are saved to my MS SQL database to display part of the document after transforming to DOM and then html as above.

This encoding thing is driving me crazy.
Let me try my question another way:

I have data that occasionally includes various ANSI Windows ASCII characters (i.e., x99 for TM, xA9 for copyright, xAE for registered trademark, etc). These characters do not display properly in web pages.

Is there a function in PHP that will allow me to automatically convert them properly? Or, do I have to have a ton of str_replace() calls that make each substitution separately? If I do have to use str_replace(), what is the proper way to match the ANSI code? I'm thinking something like:
$str = str_replace('/\x99','&trade;',$str); // trademark symbol

Open in new window

Is there a better way?
I've tried using PHP's strtr() using an array of character=>replacement pairs and have come across a glitch.

When I use the ascii codes
$clean_string = strtr($str, array(
	'\x99'   =>	'&trade;',		// trademark symbol
	'\xAE'   => 	'&reg;',		// registered trademark
	'\xA0'   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution does not seem to work. However, if I use the characters themselves
	'™'   =>	'&trade;',		// trademark symbol
	'®'   => 	'&reg;',		// registered trademark
	' '   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution seems to work fine.

I want to be able to use the codes so I can make sure I have them all covered. Any one know how I can do this?
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you, Ray! That seems to have done it!

I wish I could have accomplished the task just using the correct character code, but I wasn't making any progress that way. Hopefully this make-shift work-around will do the job!

My other option is to see if I can make friends with the government programmers and see what they do with this data to make it display properly.
Thanks for the points.  Suggest you go with that last option and see if you can figure out what they're doing.  My guess (only a guess but an educated one) is that some part of the documents are using ISO-8859-1 or equivalent.  It collides with UTF-8 when you get into some of the special characters.  

Best regards, ~Ray