Solved

How to get rid of ¿ in html?

Posted on 2013-01-15
12
262 Views
Last Modified: 2013-02-02
I have a xml -> html process using a proprietary xsl file which I can view, but not change. Some of the resultant html includes some characters which will not display in a browser except as "¿".

My first question is how can I find out exactly what character lies behind this ¿? I think one of the ¿s is the unicode character
†

Open in new window

which displays as a footnote dagger (†). But how can I be sure (since there are others as well).

My second question is how can I then search and replace these characters with their html entity equivelants? My code
$str = preg_replace('/\#8224',	'†',	$str);	// dagger

Open in new window

isn't working.
0
Comment
Question by:agrarian3
  • 5
  • 5
  • 2
12 Comments
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 38780624
You almost certainly have a character code collision of some sort.  This article explains the background you'll need to understand.
http://www.joelonsoftware.com/articles/Unicode.html

Is the document being displayed in a UTF-8 HTML page?  If not, try that.  And if you can post a link to the XML file that contains some of the errant characters, I'll be glad to have a look.
0
 

Author Comment

by:agrarian3
ID: 38782827
Thank you for the link to the article. It was a good read.

But it doesn't help me find out what is the underlying code that appears as a question mark in a black diamond (displayed as an upside-down question mark above.)

I'm dealing with xml provided by government databases created by individual consumers using various programs, which the government xsl file translates. These characters are created by the government xsl file. It appears to be the decimal unicode character I mentioned above (from looking at the xsl file).

What I want to do is "clean up" the resultant html file by replacing certain characters with their html entity equivalents.

The document is being displayed in a UTF-8 HTML page. I've put a copy of my resultant html out on: http://pdr3d.reedfax.com/tst/test.html Although I was originally writing about the black diamonds in Table 2, you will notice several where bullets belong.
0
 
LVL 9

Expert Comment

by:crazedsanity
ID: 38782918
Doing a search+replace to find these characters is almost always a losing battle.  You'll find more and more of them, and eventually the code that's supposed to make the page readable gets broken and makes the page a complete disaster.

Generally this problem can be fixed by setting the proper encoding.  Presumably others consume this message without getting such errors, so you'll need to find out what type of encoding they're using.

For more information about encoding, I would HIGHLY suggest reading this: [ http://www.joelonsoftware.com/articles/Unicode.html ].
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 9

Expert Comment

by:crazedsanity
ID: 38782942
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 38785509
xml provided by government databases...
Not exactly an unexpected problem; the governments have no reason to comply with standards or be accountable to any particular constituency, unless there is marked political pressure.  Let's try it another way.  What is your original source of the data, and what is your expected output from processing the data?

Also, have you considered changing the doctype to a standards-compliant expression?  When you render in quirks-mode there are many things that can go wrong.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 38786831
Still wondering... What is your original source of the data?  If you can give us a link to the data (before it is processed further by your scripts) we may be able to suggest something.
0
 

Author Comment

by:agrarian3
ID: 38828971
Thanks, Ray, for your persistence. My project got interrupted by our annual self evaluations and filling out forms.

The original source of data are xml files provided by the government (however they are created by private companies and may contain non-standard MS Word characters. Using XMLSpy to view two different source documents, you can see that the data can use all sorts of values:
<name>Meridian Medical Technologies™, Inc.</name>

Open in new window

where TM is the ansi character 99
BIOS LIFE&#xae; SLIM&#x2122;

Open in new window

in this case it looks like we're being given UTF-8 coding.
These names get stored in a SQL database and when retrieved look like ansi characters (meaning that they look like the (TM) and (R) symbols).
BIOS LIFE® SLIM™

Open in new window

where the TM is the superscripted trademark code.
Problem is when I try to use these values in my web script, it breaks at these characters saying they are invalid.

In my application, I have to display the information two different ways:
1) Use the source xml files to display a complete document using PHP and XSLTProcessor to transform the xml to a DOM object and then save to html to display.
2) Use the xml snippets that are saved to my MS SQL database to display part of the document after transforming to DOM and then html as above.

This encoding thing is driving me crazy.
0
 

Author Comment

by:agrarian3
ID: 38831303
Let me try my question another way:

I have data that occasionally includes various ANSI Windows ASCII characters (i.e., x99 for TM, xA9 for copyright, xAE for registered trademark, etc). These characters do not display properly in web pages.

Is there a function in PHP that will allow me to automatically convert them properly? Or, do I have to have a ton of str_replace() calls that make each substitution separately? If I do have to use str_replace(), what is the proper way to match the ANSI code? I'm thinking something like:
$str = str_replace('/\x99','&trade;',$str); // trademark symbol

Open in new window

Is there a better way?
0
 

Author Comment

by:agrarian3
ID: 38832957
I've tried using PHP's strtr() using an array of character=>replacement pairs and have come across a glitch.

When I use the ascii codes
$clean_string = strtr($str, array(
	'\x99'   =>	'&trade;',		// trademark symbol
	'\xAE'   => 	'&reg;',		// registered trademark
	'\xA0'   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution does not seem to work. However, if I use the characters themselves
	'™'   =>	'&trade;',		// trademark symbol
	'®'   => 	'&reg;',		// registered trademark
	' '   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution seems to work fine.

I want to be able to use the codes so I can make sure I have them all covered. Any one know how I can do this?
0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 38844729
Try using the chr() function like this.
<?php // RAY_temp_agrarian3.php.php
error_reporting(E_ALL);
echo chr(0x99); // SHOWS TM SYMBOL

Open in new window

Best regards, ~Ray
0
 

Author Comment

by:agrarian3
ID: 38846853
Thank you, Ray! That seems to have done it!

I wish I could have accomplished the task just using the correct character code, but I wasn't making any progress that way. Hopefully this make-shift work-around will do the job!

My other option is to see if I can make friends with the government programmers and see what they do with this data to make it display properly.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 38846916
Thanks for the points.  Suggest you go with that last option and see if you can figure out what they're doing.  My guess (only a guess but an educated one) is that some part of the documents are using ISO-8859-1 or equivalent.  It collides with UTF-8 when you get into some of the special characters.  

Best regards, ~Ray
0

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
In this tutorial viewers will learn how to embed videos in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <video> tag to insert a video. Define the src as the URL of your video; this is similar to …
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question