Cleaning up input strings to mySQL

When I copy paste text from a web page that uses smart quotes, etc., it ends up as garbase in my DB, e.g.,

Lehman Channeled Risks Through ‘Alter Ego’ Firm

How do I clean that stuff, covert it to regular quotes, for example? I am using the mysqli_real_escape_string function in my sql inserts and updates, bu tthat's all.
metalaureateAsked:
Who is Participating?
 
Steve BinkConnect With a Mentor Commented:
I just want to point out that Michael701's approach assumes English.  Specifically, it assumes the latin character set.  With the increasing amount of globalization/internationalization, there is more and more data being provided in UTF8 or other international sets.  The example you provided in your question looks like a classic example of misplacing the encoding, such as receiving data as UTF8 and pasting into an application using a regional encoding.

If you're only targeting English-speaking audiences, then Michael701's approach is a good start, but you will consistently find more and more lines to add to the clean_post() function.  A better short term solution would be to use conversion functions to re-encode the data into your application's character set.  The long term solution is to familiarize yourself with internationalization, and create your applications to deal with data that use those encodings.
0
 
Michael701Connect With a Mentor Commented:
Over the years I've refined this set of replacements to clean up user submitted data.

There is one very important note: I store data in HTML format within the MySQL tables. Why you ask? Because when I retrieve the data 99.99% of the time it's to display on a web page. This way I don't have to worry about always converting it back to HTML valid code.


To call the function I'd have lines like this at the top of my php processing script.

if (isset($_POST['submit']))
{
  $customer['name'] = clean_post($_POST['customer_name']);
  $customer['address1'] = clean_post($POST['customer_address1']);
....

function clean_post($string, $br=null)
{
  $string = trim(htmlspecialchars(stripslashes($string), ENT_QUOTES));
  $string = str_replace('“','"',$string);
  $string = str_replace(chr(226).chr(128).chr(156),'"',$string);
  $string = str_replace('”','"',$string);
  $string = str_replace(chr(226).chr(128).chr(157),'"',$string);
  $string = str_replace(chr(194).chr(178),'"',$string); // mac smart quotes
  $string = str_replace(chr(194).chr(179),'"',$string); // mac smart quotes
  $string = str_replace('‘',"'",$string);
  $string = str_replace(chr(226).chr(128).chr(152),''',$string);
  $string = str_replace('’',"'",$string);
  $string = str_replace(chr(226).chr(128).chr(153),''',$string);
  $string = str_replace(chr(194).chr(185),"'",$string); // mac smart quotes
  $string = str_replace('—',"-",$string);
  $string = str_replace(chr(239).chr(191).chr(189),'-',$string);
  $string = str_replace('…',"...",$string);
  $string = str_replace(chr(226).chr(128).chr(166),'...',$string);
  if ($br != null)
      $string = nl2br($string);
  $string = str_replace("\n",'',$string); 
  $string = str_replace("\r",'',$string); 
  return $string;
}

Open in new window

0
 
rdivilbissConnect With a Mentor Commented:
If routinet hadn't said that, I would have.  That is a very expensive operation (the cleaning function) and you will find something you'll have to add making it even more expensive.

Now you very well have a valid reason to use only ASCII - Latin characters, but you still need to filter that input against malicious input.

I happen to have a world wide audience for many of my sites so I can't clean data as you are discussing, but use UTF-8 both for all my web pages but also for the database.  Can't clean https://www.webloginproject.com/login-project/Viet/ into ASCII - Latin.

Why not keep the smart quotes?  They are not malicious input and you'll learn a lot by learning to deal with them.

Just for consideration.
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
KysolConnect With a Mentor Commented:
It's not always different languages that cause this. Copying data from MS Word documents can cause character conflicts when re-displaying the content on a page not using the same character encoding.

This is something I used a long long time ago, not sure if it works still but worth a try if this is what you are experiencing (not sure where I got this code from sorry).
function ascii_to_entities($str)
   {
      $count   = 1;
      $out  = '';
      $temp = array();

      for ($i = 0, $s = strlen($str); $i < $s; $i++)
      {
         $ordinal = ord($str[$i]);

         if ($ordinal < 128)
         {
            $out .= $str[$i];
         }
         else
         {
            if (count($temp) == 0)
            {
               $count = ($ordinal < 224) ? 2 : 3;
            }

            $temp[] = $ordinal;

            if (count($temp) == $count)
            {
               $number  = ($count == 3) ?
                          (($temp['0'] % 16) * 4096) + (($temp['1'] % 64) * 64) + ($temp['2'] % 64) :
                          (($temp['0'] % 32) * 64) + ($temp['1'] % 64);

               $out .= '&#'.$number.';';
               $count = 1;
               $temp = array();
            }
         }
      }

      return $out;
   }

Open in new window

0
 
metalaureateAuthor Commented:
Thanks everyone. Where do I learn what I need to make web work in utf 8?
0
 
KysolCommented:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
0
 
metalaureateAuthor Commented:
Thanks. I'll ask a follow up question.
0
 
rdivilbissCommented:
If you have ASP pages you'll also need:

<%
Option Explicit
Session.CodePage=65001
Response.Charset="UTF-8"
%>

and for PHP you need:

<?PHP
setlocale(LC_ALL, 'English_United States.65001');
?>

where English_United States may be one of many language strings and country/region strings combinations.

http://msdn.microsoft.com/en-us/library/39cwe7zf%28vs.71%29.aspx
http://msdn.microsoft.com/en-us/library/cdax410z%28vs.71%29.aspx

What's important above id the 65001 codepage is that it specifies UTF-8 and as ASP or PHP is often executed before your meta tag is written to the browser you need to let the browser know character set what to use.

Of course you can also use

Response.AddHeader "Content-Type", "text/html;charset=UTF-8"

or

header("Content-Type: text/html;charset=utf-8);

if your going to be outputting data to the browser before the meta tag.

Other languages will have similar commands.

http://www.joelonsoftware.com/articles/Unicode.html
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.