How to store this character in MySQL correctly, clean input field with PHP

I'm having trouble developing a PHP function which allows users to copy paste content into a text area and have it inserted into a database and spit out on another page. Semantically I'm fine however there are some characters where users have copy pasted that somehow are not encoded into their respective html entities and when I spit them back out, I get the ¿ character instead of what was supposed to be there.

Mainly it seems that the ---->    ’     <------ is the problem, as well as the -----> “ <-------- . These seem to just be right and left single and double quotes however I have no found a single PHP function that will convert these into the correct html entity!

Any help would be greatly appreciated.
LVL 1
TheFuteballerAsked:
Who is Participating?
 
hexer4uConnect With a Mentor Commented:
@TheFuteballer
Use the meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to have the browser treat the characters as utf8.
This is only usefull if the file IS saved as utf8.
@rjdown: if the file is saved ISO8859-1 and you use utf8 meta tag, the browser will treat all encoding as utf8, even though they are ISO. The meta tag should reflect the file encoding.

@TheFuteballer
Also, take a look at the database tables and see the encoding for the entire database, the encoding for each table and for each field that requires it (varchar, text, etc). Make sure they are correct. Usually you should use utf8_unicode_ci if you don't require anything else.
If you have phpmyadmin you ca use that to see the encodings.

Again! carefull when changing encodings. Make an export before doing anything as conversions may mess up your characters.
Also, take a look at http://php.net/manual/en/function.utf8-encode.php
0
 
themrrobertCommented:
Maybe when you are pasting you are actually pasting 'Smart' [stupid] quotes?
0
 
themrrobertCommented:
These 'smart' characters are typically utf-8/16 / unicode and may be 16bit which php can't handle without mbstring. Even with mbstring, you need to make sure functions are unicode compatible.

You are better off just erasing the quotes and re-entering them, but you can try and find the asc code that is returned from each of the 4 characters and try to replace with normal quotes based on that.

SMART QUOTES ARE NOT PROGRAMMERS FRIENDS!!! =p

There may be a better solution, but if php wont do it natively, you will need to recover the asc data and possibly other data and replace via

str_replace(array("\x66","\x67"), '"',  $myString);

and again for the ` and ' single quotes. (note: the numbers i gave are not the correct ascii code, just an example to use once you find the asc code
0
Cloud Class® Course: CompTIA Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

 
TheFuteballerAuthor Commented:
Is there anyway to convert smart quotes to their html entity? I've even tried doing preg replace (and more simply str_replace) to swap out the characters before I send it over to the database and it still doesn't seem to work.

As an example, the text on this page http://bit.ly/bSqP2l gives me issues with the quotes as well.
0
 
TheFuteballerAuthor Commented:
Thanks themrrobert, I'll try out your solution and hopefully it'll rid me of this horrible nightmare :)
0
 
themrrobertCommented:
Here is a function with the correct codes: (also replaces the word double hyphen)

<?php
 function convert_smart_quotes($string) {
  //converts smart quotes to normal quotes.
  $search = array(chr(145), chr(146), chr(147), chr(148), chr(151));
  $replace = array("'", "'", '"', '"', '-');

  return str_replace($search, $replace, $string);
 }
?>

Open in new window

0
 
themrrobertCommented:
Use this function to keep smart quotes

<?php
 function convert_smart_quotes($string) {
  //converts smart quotes to normal quotes.
  $search = array(chr(145), chr(146), chr(147), chr(148), chr(151));
  $replace = array("&lsquo;", "&rsquo;", '&ldquo;', '&rdquo;', '-');

  return str_replace($search, $replace, $string);
 }
?>

Open in new window

0
 
Dave BaldwinFixer of ProblemsCommented:
Use 'addslashes()' to escape quotes before sending them to the database and 'stripslashes()" to get rid of them after you retreive them.
0
 
themrrobertCommented:
Here is same code with the correct comments:

This will keep smart quotes on your page, and be php / html friendly.

<?php
 function convert_smart_quotes($string) {
  //converts smart quotes to html entities
  $search = array(chr(145), chr(146), chr(147), chr(148), chr(151));
  $replace = array('&lsquo;', '&rsquo;', '&ldquo;', '&rdquo;', '-');

  return str_replace($search, $replace, $string);
 }
?>

Open in new window

0
 
themrrobertCommented:
@DaveBaldwin That won't work with smart quotes, and his problem is with smart quotes.
0
 
Dave BaldwinFixer of ProblemsCommented:
Here are the HTML entity codes http://www.alanwood.net/demos/ent4_frame.html for the right and left quotes. And here's the page on PHP 'htmlentities': http://us2.php.net/manual/en/function.htmlentities.php
0
 
TheFuteballerAuthor Commented:
@DaveBaldwin, thanks for your suggestions but as themrrobert mentioned, smart quotes aren't processed by php's htmlentity() function. It's a solution I've tried already with no success.
0
 
themrrobertCommented:
@DaveBaldwin I already gave him the HTML entities quit trying to steal my points =p

If you have a valid suggestion, no problem, but it looks like your trying to claim points when you don't even know the situation
0
 
Dave BaldwinFixer of ProblemsCommented:
I misread the question the first.  After research, your solutions are the correct ones.
0
 
themrrobertCommented:
@author / TheFuteballer:

You should have all you need to make your script work now.

Be sure that AFTER you run my convert_smart_quotes() function, you run addslashes to make sure that REGULAR quotes won't mess up your SQL commands.

My solution does exactly what you wanted. If it doesn't work , then you don't have mbstring installed, and you will need to fix that (but thats only if you have an old install)
0
 
themrrobertCommented:
Thank you :)
0
 
TheFuteballerAuthor Commented:
@ themrrobert,

Unfortunately your solution did not work. I tried multiple chr values as well as none of them are converting the charactering. I've even tried chr(ord('’')) to give me the correct ascii value if I'm getting it wrong and it's still not being replaced.
0
 
rjdownCommented:
No need to convert, simply ensure your page is capable of handling such characters by putting this in your <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Open in new window

0
 
rjdownCommented:
Just out of curiosity...

“test”
0
 
TheFuteballerAuthor Commented:
@rjdown, I'm beginning to think it has something to do with charset issues as well however I already have that declaration in all my pages headers!
0
 
rjdownConnect With a Mentor Commented:
What's your database field's character set and collation set to?

Try utf8 for the character set and utf8_unicode_ci for the collation.
0
 
TheFuteballerAuthor Commented:
It was at utf8_general_ci and I just changed it to unicode and still displays the same thing... in my database however I don't actually have the htmlentity being stored, it's still storing the actual character itself.
0
 
hexer4uCommented:
you might want to see the encoding of the file in which data will be displayed. make sure the encodings are OK.
user notepad++ if you don't have other options. Open the file in notepad++ and go to "Encoding". Usually, even if the data is stored OK in the DB, the file encoding is the problem.
Make sure that the encoding of the file that writes into DB is the same as the one who loads and views. Oh, and don't use BOM
0
 
TheFuteballerAuthor Commented:
@hexer4u,

Thanks for your response, this might be the issue. Do you know of anyway to check file encoding within OS X?

Also what is BOM?
0
 
TheFuteballerAuthor Commented:
@hexer4u,

Okay found out the files are encoded in us-ascii by using "file -I {filename}" command in terminal. What's the best way to go about re-encoding the files to UTF-8 (and how do I avoid BOM).

Could this be the cause of the problem?
0
 
rjdownCommented:
This shouldn't affect web documents - even if it did you would see the BOM displayed at the beginning if you viewed the page source from the browser. It would appear like this: 

I just created a simple page with a textbox, the data from which is entered directly into a database field with the character encoding set as suggested above, then retrieved and displayed directly on another simple HTML page with the META tag above. It all seemed to work fine :( Perhaps you could try the same, to try and eliminate unseen issues.
0
 
TheFuteballerAuthor Commented:
Just checked on my windows machine with notepad++ and it's on ANSI encoding according to it. I just changed it to "convert to UTF-8" (the  UTF-8 no-BOM selection still kept the encoding as ANSI for some reason) but now it's broken the layout of my page on Safari...

Could it be my server that has an encoding issue? How can I check this?
0
 
rjdownCommented:
There is no server setting that affects the encoding - that is handled by the browser based on META tags.

How are you inserting your data into the database? I assume you're avoiding SQL injection attacks by using mysql_real_escape_string? Note that mysql_escape_string (note the missing _real_) does not preserve the character set during transport so that could be one cause.

Also, how are you displaying your data? Are you escaping it to prevent XSS exploits? Some functions (e.g. htmlentities) will break UTF strings, while others (e.g. htmlspecialchars) will not...
0
 
TheFuteballerAuthor Commented:
I'm using Codeigniter's active record class, it performs all the XSS filtering and SQL safe insertion. This isn't a codeigniter issue though as no one else that I've been able to find is having this issue (or has documented it anyways).
0
 
hexer4uCommented:
when you save a file in an encoding, if the file doesn't contain any characters that require that encoding it will revert to a less memory consuming one.
this means that if you use utf8 encoding and you don't have any characters that require it, it will save them as ANSI as it uses less memory for storing. Just write a php comment that contains a UTF8 character, convert and save. It should keept the format.

Before any encoding change, make a backup as some conversions might mess up your characters and thus mess up some code
0
 
TheFuteballerAuthor Commented:
So I managed to fix the issue through a combination of things. I first ran the script below to make sure I converted each and every field in the database to UTF8:

[code]<?php
// Script written by Vladislav "FractalizeR" Rastrusny
// http://www.fractalizer.ru

//MySQL connection settings
$db_server = 'localhost';
$db_user="user";
$db_password="pass";

mysql_connect($db_server, $db_user, $db_password) or die(mysql_error());

//Put here a list of databases you need to change charset at or leave array empty to change all existing
$dblist=array();

//If changing at all databases, which databases to skip? information_schema is mysql system databse and no need to change charset on it.
$skip_db_list = array('information_schema', 'mysql');

//Which charset to convert to?
$charset="utf8";

//Which collation to convert to?
$collation="utf8_unicode_ci";

//Only print queries without execution?
$printonly=false;

//Getting database names if they are not specified
$skip_db_text = '"'.implode('", "', $skip_db_list).'"';
if(count($dblist)<1) {
    $sql="SELECT GROUP_CONCAT(`SCHEMA_NAME` SEPARATOR ',') AS FRST FROM `information_schema`.`SCHEMATA` WHERE `SCHEMA_NAME` NOT IN ($skip_db_text)";
    $result = mysql_query($sql) or die(mysql_error());
    $data = mysql_fetch_assoc ($result);
    $dblist=explode(",", $data["FRST"]);
}

//Iterating databases
foreach ($dblist as $dbname) {
    $sql="SELECT CONCAT('ALTER TABLE `', t.`TABLE_SCHEMA`, '`.`', t.`TABLE_NAME`, '` CONVERT TO CHARACTER SET $charset COLLATE $collation;') as FRST FROM `information_schema`.`TABLES` t WHERE t.`TABLE_SCHEMA` = '$dbname' ORDER BY 1";

    $result = mysql_query($sql) or die(mysql_error());
    while ($row = mysql_fetch_assoc($result)) {
        echo $row["FRST"]."\r\n";
        if(!$printonly) {
            mysql_query($row["FRST"]) or die(mysql_error());
        }
    }
}
?> [/code]

Then I went through and anywhere I was inserting anything into the database I made sure I added
[code]mysql_query("SET NAMES 'UTF8'");  [/code]
before I inserted or updated anything in the database

and finally I added htmlentities to all my posted data
[code]htmlentities($input, ENT_QUOTES, "UTF-8", FALSE),[/code]

And then my problem magically went away!
0
 
TheFuteballerAuthor Commented:
Suggestions didn't actually fix my issues but helped me go pinpoint possible issues of where it was coming from
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.