Cleaning up input strings to mySQL

When I copy paste text from a web page that uses smart quotes, etc., it ends up as garbase in my DB, e.g.,

Lehman Channeled Risks Through ‘Alter Ego’ Firm

How do I clean that stuff, covert it to regular quotes, for example? I am using the mysqli_real_escape_string function in my sql inserts and updates, bu tthat's all.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Over the years I've refined this set of replacements to clean up user submitted data.

There is one very important note: I store data in HTML format within the MySQL tables. Why you ask? Because when I retrieve the data 99.99% of the time it's to display on a web page. This way I don't have to worry about always converting it back to HTML valid code.

To call the function I'd have lines like this at the top of my php processing script.

if (isset($_POST['submit']))
  $customer['name'] = clean_post($_POST['customer_name']);
  $customer['address1'] = clean_post($POST['customer_address1']);

function clean_post($string, $br=null)
  $string = trim(htmlspecialchars(stripslashes($string), ENT_QUOTES));
  $string = str_replace('“','"',$string);
  $string = str_replace(chr(226).chr(128).chr(156),'"',$string);
  $string = str_replace('”','"',$string);
  $string = str_replace(chr(226).chr(128).chr(157),'"',$string);
  $string = str_replace(chr(194).chr(178),'"',$string); // mac smart quotes
  $string = str_replace(chr(194).chr(179),'"',$string); // mac smart quotes
  $string = str_replace('‘',"'",$string);
  $string = str_replace(chr(226).chr(128).chr(152),''',$string);
  $string = str_replace('’',"'",$string);
  $string = str_replace(chr(226).chr(128).chr(153),''',$string);
  $string = str_replace(chr(194).chr(185),"'",$string); // mac smart quotes
  $string = str_replace('—',"-",$string);
  $string = str_replace(chr(239).chr(191).chr(189),'-',$string);
  $string = str_replace('…',"...",$string);
  $string = str_replace(chr(226).chr(128).chr(166),'...',$string);
  if ($br != null)
      $string = nl2br($string);
  $string = str_replace("\n",'',$string); 
  $string = str_replace("\r",'',$string); 
  return $string;

Open in new window

Steve BinkCommented:
I just want to point out that Michael701's approach assumes English.  Specifically, it assumes the latin character set.  With the increasing amount of globalization/internationalization, there is more and more data being provided in UTF8 or other international sets.  The example you provided in your question looks like a classic example of misplacing the encoding, such as receiving data as UTF8 and pasting into an application using a regional encoding.

If you're only targeting English-speaking audiences, then Michael701's approach is a good start, but you will consistently find more and more lines to add to the clean_post() function.  A better short term solution would be to use conversion functions to re-encode the data into your application's character set.  The long term solution is to familiarize yourself with internationalization, and create your applications to deal with data that use those encodings.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
If routinet hadn't said that, I would have.  That is a very expensive operation (the cleaning function) and you will find something you'll have to add making it even more expensive.

Now you very well have a valid reason to use only ASCII - Latin characters, but you still need to filter that input against malicious input.

I happen to have a world wide audience for many of my sites so I can't clean data as you are discussing, but use UTF-8 both for all my web pages but also for the database.  Can't clean into ASCII - Latin.

Why not keep the smart quotes?  They are not malicious input and you'll learn a lot by learning to deal with them.

Just for consideration.
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

It's not always different languages that cause this. Copying data from MS Word documents can cause character conflicts when re-displaying the content on a page not using the same character encoding.

This is something I used a long long time ago, not sure if it works still but worth a try if this is what you are experiencing (not sure where I got this code from sorry).
function ascii_to_entities($str)
      $count   = 1;
      $out  = '';
      $temp = array();

      for ($i = 0, $s = strlen($str); $i < $s; $i++)
         $ordinal = ord($str[$i]);

         if ($ordinal < 128)
            $out .= $str[$i];
            if (count($temp) == 0)
               $count = ($ordinal < 224) ? 2 : 3;

            $temp[] = $ordinal;

            if (count($temp) == $count)
               $number  = ($count == 3) ?
                          (($temp['0'] % 16) * 4096) + (($temp['1'] % 64) * 64) + ($temp['2'] % 64) :
                          (($temp['0'] % 32) * 64) + ($temp['1'] % 64);

               $out .= '&#'.$number.';';
               $count = 1;
               $temp = array();

      return $out;

Open in new window

metalaureateAuthor Commented:
Thanks everyone. Where do I learn what I need to make web work in utf 8?
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
metalaureateAuthor Commented:
Thanks. I'll ask a follow up question.
If you have ASP pages you'll also need:

Option Explicit

and for PHP you need:

setlocale(LC_ALL, 'English_United States.65001');

where English_United States may be one of many language strings and country/region strings combinations.

What's important above id the 65001 codepage is that it specifies UTF-8 and as ASP or PHP is often executed before your meta tag is written to the browser you need to let the browser know character set what to use.

Of course you can also use

Response.AddHeader "Content-Type", "text/html;charset=UTF-8"


header("Content-Type: text/html;charset=utf-8);

if your going to be outputting data to the browser before the meta tag.

Other languages will have similar commands.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.