• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 574
  • Last Modified:

PHP and HTML format errors


I have a PHP script reading from a MySQL database where teachers have entered narrative comments for students in association with their report card.

I made the mistake, perhaps, of allowing the PHP scripts to accept rich text format. That way, teachers could copy and paste tables of grades, entries from essays in Microsoft Word, etc. They wanted the formatting preserved.

When viewed in the editor it is fine. In the MySQL it is piled up with tags.

When I parse it using a PHP script to output to the screen for parents I run into bizarre formatting errors.

I get the following character where there should be blank spaces or apostrophes: ý

The code for example is as follows:

The output on one line looks like: "ýýýýýýýýýýý Fredýs writing has become smoother since August"

The HTML source code after it has been parsed to the HTML site looks like:
<p class="MsoNormal"><span lang="EN-GB"><span style="mso-tab-count:1">ýýýýýýýýýýý </span>Fredýs
writing has become smoother since August..<span style="mso-spacerun: yes">ý </span>

The MySQL as it appears in phpMyAdmin looks like:
<p class="MsoNormal"><span lang="EN-GB"><span style="mso-tab-count:1">            </span>Freds
writing has become smoother since August, and she organizes her thoughts much
better now than then.<span style="mso-spacerun: yes">  </span>


It looks like a little box in IE, a question mark in a diamond in Firefox. Ironically, it looks perfect in Google Chrome. I don't want to tell all the parents that they need to download Google Chrome in order to make it look good though. There would be angry parents.

Can I add something to the PHP code so that it parses all of the formatting for spaces and apostrophes without the weird characters?

  • 2
1 Solution
I've seen this happen... and it was a pain figuring out what it was... but now I know!  Here was my response to the problem:

The reason why this is happening, is because youre probably writing out your text in Microsoft Word, then copy/pasting them in to the rich-text editor.

When you copy text from Microsoft Word, Word adds a bunch of invisible extra junk.  To avoid any future similar issues, I have 3 suggestions (in order of efficiency).

  1. Type the comments directly into the textbox.
  2. Type the comments in Notepad (Start -> All Programs -> Accessories -> Notepad). Then copy/paste from there.
  3. You could also type the text in Word, then copy/paste to Notepad, then copy/paste to the textarea, but that would mean an extra step.

If you're using TinyMCE, you can add a Paste from Word button to the toolbar.  After running a few tests, Ive found that its not as efficient as Id hoped.  It will reduce the junk, but I cant guaranty that it will remove all of it.

I would assume/hope that other rich-text editors have something similar.
jkeagle13Author Commented:

Thanks for the advice. I am slowly learning!

The problem is that I now have ~ 2000 records that have been copied and pasted from Word and tagged up in the MySQL. I need to have them in readable format by midnight for public access.

I would think there should be some way to parse it without the formatting problems. The bizarre part is that Google Chrome reads it fine, no extra characters!

Any ideas?

You're going to need some HTML Purifier magic!

HTML Purifier is a free HTML-cleaner-upper PHP library.  If you write a little script that will go through your 2,000 records, pass the text through the purifier, and update the record with the clean HTML.

It's easy to use and there's a lot of documentation on their site.

You should be able to pull it off before midnight!

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now