Solved

How to get rid of ¿ in html?

Posted on 2013-01-15
12
257 Views
Last Modified: 2013-02-02
I have a xml -> html process using a proprietary xsl file which I can view, but not change. Some of the resultant html includes some characters which will not display in a browser except as "¿".

My first question is how can I find out exactly what character lies behind this ¿? I think one of the ¿s is the unicode character
†

Open in new window

which displays as a footnote dagger (†). But how can I be sure (since there are others as well).

My second question is how can I then search and replace these characters with their html entity equivelants? My code
$str = preg_replace('/\#8224',	'†',	$str);	// dagger

Open in new window

isn't working.
0
Comment
Question by:agrarian3
  • 5
  • 5
  • 2
12 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38780624
You almost certainly have a character code collision of some sort.  This article explains the background you'll need to understand.
http://www.joelonsoftware.com/articles/Unicode.html

Is the document being displayed in a UTF-8 HTML page?  If not, try that.  And if you can post a link to the XML file that contains some of the errant characters, I'll be glad to have a look.
0
 

Author Comment

by:agrarian3
ID: 38782827
Thank you for the link to the article. It was a good read.

But it doesn't help me find out what is the underlying code that appears as a question mark in a black diamond (displayed as an upside-down question mark above.)

I'm dealing with xml provided by government databases created by individual consumers using various programs, which the government xsl file translates. These characters are created by the government xsl file. It appears to be the decimal unicode character I mentioned above (from looking at the xsl file).

What I want to do is "clean up" the resultant html file by replacing certain characters with their html entity equivalents.

The document is being displayed in a UTF-8 HTML page. I've put a copy of my resultant html out on: http://pdr3d.reedfax.com/tst/test.html Although I was originally writing about the black diamonds in Table 2, you will notice several where bullets belong.
0
 
LVL 9

Expert Comment

by:crazedsanity
ID: 38782918
Doing a search+replace to find these characters is almost always a losing battle.  You'll find more and more of them, and eventually the code that's supposed to make the page readable gets broken and makes the page a complete disaster.

Generally this problem can be fixed by setting the proper encoding.  Presumably others consume this message without getting such errors, so you'll need to find out what type of encoding they're using.

For more information about encoding, I would HIGHLY suggest reading this: [ http://www.joelonsoftware.com/articles/Unicode.html ].
0
 
LVL 9

Expert Comment

by:crazedsanity
ID: 38782942
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38785509
xml provided by government databases...
Not exactly an unexpected problem; the governments have no reason to comply with standards or be accountable to any particular constituency, unless there is marked political pressure.  Let's try it another way.  What is your original source of the data, and what is your expected output from processing the data?

Also, have you considered changing the doctype to a standards-compliant expression?  When you render in quirks-mode there are many things that can go wrong.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38786831
Still wondering... What is your original source of the data?  If you can give us a link to the data (before it is processed further by your scripts) we may be able to suggest something.
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 

Author Comment

by:agrarian3
ID: 38828971
Thanks, Ray, for your persistence. My project got interrupted by our annual self evaluations and filling out forms.

The original source of data are xml files provided by the government (however they are created by private companies and may contain non-standard MS Word characters. Using XMLSpy to view two different source documents, you can see that the data can use all sorts of values:
<name>Meridian Medical Technologies™, Inc.</name>

Open in new window

where TM is the ansi character 99
BIOS LIFE&#xae; SLIM&#x2122;

Open in new window

in this case it looks like we're being given UTF-8 coding.
These names get stored in a SQL database and when retrieved look like ansi characters (meaning that they look like the (TM) and (R) symbols).
BIOS LIFE® SLIM™

Open in new window

where the TM is the superscripted trademark code.
Problem is when I try to use these values in my web script, it breaks at these characters saying they are invalid.

In my application, I have to display the information two different ways:
1) Use the source xml files to display a complete document using PHP and XSLTProcessor to transform the xml to a DOM object and then save to html to display.
2) Use the xml snippets that are saved to my MS SQL database to display part of the document after transforming to DOM and then html as above.

This encoding thing is driving me crazy.
0
 

Author Comment

by:agrarian3
ID: 38831303
Let me try my question another way:

I have data that occasionally includes various ANSI Windows ASCII characters (i.e., x99 for TM, xA9 for copyright, xAE for registered trademark, etc). These characters do not display properly in web pages.

Is there a function in PHP that will allow me to automatically convert them properly? Or, do I have to have a ton of str_replace() calls that make each substitution separately? If I do have to use str_replace(), what is the proper way to match the ANSI code? I'm thinking something like:
$str = str_replace('/\x99','&trade;',$str); // trademark symbol

Open in new window

Is there a better way?
0
 

Author Comment

by:agrarian3
ID: 38832957
I've tried using PHP's strtr() using an array of character=>replacement pairs and have come across a glitch.

When I use the ascii codes
$clean_string = strtr($str, array(
	'\x99'   =>	'&trade;',		// trademark symbol
	'\xAE'   => 	'&reg;',		// registered trademark
	'\xA0'   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution does not seem to work. However, if I use the characters themselves
	'™'   =>	'&trade;',		// trademark symbol
	'®'   => 	'&reg;',		// registered trademark
	' '   => 	'&nbsp;'));	// non-breaking space

Open in new window

the substitution seems to work fine.

I want to be able to use the codes so I can make sure I have them all covered. Any one know how I can do this?
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 38844729
Try using the chr() function like this.
<?php // RAY_temp_agrarian3.php.php
error_reporting(E_ALL);
echo chr(0x99); // SHOWS TM SYMBOL

Open in new window

Best regards, ~Ray
0
 

Author Comment

by:agrarian3
ID: 38846853
Thank you, Ray! That seems to have done it!

I wish I could have accomplished the task just using the correct character code, but I wasn't making any progress that way. Hopefully this make-shift work-around will do the job!

My other option is to see if I can make friends with the government programmers and see what they do with this data to make it display properly.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 38846916
Thanks for the points.  Suggest you go with that last option and see if you can figure out what they're doing.  My guess (only a guess but an educated one) is that some part of the documents are using ISO-8859-1 or equivalent.  It collides with UTF-8 when you get into some of the special characters.  

Best regards, ~Ray
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

SASS allows you to treat your CSS code in a more OOP way. Let's have a look on how you can structure your code in order for it to be easily maintained and reused.
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
In this tutorial viewers will learn how to style a corner ribbon overlay for an image using CSS Create a new class by typing ".Ribbon":  Define the class' "display:" as "inline-block": Define its "position:" as "relative": Define its "overflow:" as …
In this tutorial viewers will learn how to embed Flash content in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <object> tag to embed Flash content.: To specify that the object is Flash content, d…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now