Link to home
Start Free TrialLog in
Avatar of elepil
elepil

asked on

PHP encoding/decoding problem

Assume my web page to be something like this:

<?php>
    // Assume that the 'firstname' value from the session is the name O'Malley
    $firstname = $_SESSION['firstname'];
?>

<html>
    <input id="txtFirstName" value="<?php echo $firstname; ?>" />
</html>

Open in new window


The above will work just fine by displaying O'Malley in the text field.

But if the first name value from the session were O"Malley (NOTE, it's a double-quote this time), it will just display "O". I understand why this is happening because the HTML now will look like this after it is rendered:

<html>
    <input id="txtFirstName" value="O"Malley" />
</html>

Open in new window


I need to be able to display quotation marks in fields if the user chooses to input such. Maybe the user wants to enter O'Malley "The Man" as the first name value, it will mess up when displayed.

Can anyone show me how to make this work please?

Thanks.
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Understanding how PHP handles quotes may help give you a foundation.  I know this is a lot to read and understand, but it is what we deal with every day in web development.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_12241-Quotation-Marks-in-PHP.html

External input should be "kept" in the original format.  When your PHP scripts echo the original input to the browser, we typically choose to use htmlentities() so that we do not send toxic JavaScript to our community.  Bad guys often post bad things to "open relay" forums, and we want to prevent that from hurting our online communities.

A smart way of dealing with external oddities like double quotes in place of apostrophes might be to remove the inappropriate double quotes and replace them with apostrophes.  In my experience this is an "edge case" that has little value in the overall design of the application.  It's all about choosing what constitutes acceptable external input.

If you want some filtering and sanitization examples, please post a list of the external inputs and the corresponding acceptable values.  I'll be glad to show you how the external data can be made safe for use in your PHP scripts.
Avatar of elepil
elepil

ASKER

Ray, thanks for responding.

I am more interested in preserving the fidelity of my users' inputs rather than substituting quotation marks for apostrophes.

Based on your response, I assume you are saying there is no "to-the-point" solution to this kind of a problem and would recommend doing work-arounds (e.g. substituting single quotes for double quotes)? Because I'm sure this problem has has vexed developers for decades, and by now, I'm assuming a solution would've been found.
Avatar of Dave Baldwin
The 'solution' is to process all input and output as needed.  In a previous question, I mentioned Percent Encoding  https://en.wikipedia.org/wiki/Percent-encoding .  It is used as part of urlencoding/decoding when you send information as a GET in a URL or in POST information.  Browsers already do this automatically and transparently to "preserve the fidelity of your user's inputs".

This page http://www.w3schools.com/tags/ref_urlencode.asp shows the percent encoding for both Windows-1252 and UTF-8.  Note that it does Not show anything less than a space character.  Things like tabs, bare newline and carriage returns will not display on their own in an HTML page.  They are simply not recognized by the browsers.

HTML was created as a fairly simple way of displaying information.  If you really need all characters to be recognized, then you need something other than HTML.
I should also note that one of the more irritating problems when your users copy and paste from a Word document with 'smart quotes', the ones that slant in different directions.  'smart quotes' are not ASCII characters, they are part of the Windows-1252 character set above character code 127.  If your web page is declared as UTF-8, they won't display properly even though you have 'delivered' them to the web page.  There are a number of other characters that also cause problems.

I use this page http://www.alanwood.net/ as a reference for those sorts of problems.
Avatar of elepil

ASKER

Dave, thanks for responding.

I made sure my code snippet is as simple and concise as it can be to show my issue. Is there a way you can just modify it to show me the functions you would use?
Part 1.

Your code sample is actually too simple because it allows the user's browser to decide what the character set.  Yes, all browsers do that.  There is always a character set in use.  In Firefox, you can click on View->Text Encoding to see what is the current character set for a page.  If you want to use a different character set for your page, you must declare it in your page.

And in conjunction with character sets is fonts.  Fonts only support a limited number of characters in a character set.  Unicode/UTF-8 defines over 110,000 characters.  http://www.alanwood.net/unicode/index.html

In addition to the character set and fonts or maybe beside them are the 252 HTML 4.01 Character Entity References http://www.alanwood.net/demos/ent4_frame.html .  They can be referred to by number or name like diamond suit (&#9830; or &diams;).  If you look on that page, you will see about 11 different kinds of quote marks.  

This page gives a little bit of history about HTML character sets: http://www.w3schools.com/charsets/default.asp
Part 2.

I'm writing two scripts, one is POST and the other uses GET so you can see what happens.  I recommend getting the Live HTTPHeaders addon for Firefox so you can see what the browser is actually sending to the server.  https://addons.mozilla.org/en-us/firefox/addon/live-http-headers/
Part 3.

Here are the first versions of the scripts.  You can use Live HTTPHeaders to see what Firefox actually sends to the server.  You can see that in both versions, POST and GET, some characters are Percent encoded.  You can also do 'View Source' to see what is being returned to the browser.  And of course you can edit it to see what happens.

PHP-CharTestPost.php
<?php
error_reporting(E_ALL);
ini_set('display_errors','1');

# some settings of POST vars
if (!isset($_POST['submit']))  $submit = ''; else $submit = $_POST['submit'];
if (!isset($_POST['msgText'])) $msgText = ''; else $msgText = $_POST['msgText'];

if ($submit == "") {
    $title="Test Page";
    $announce="---";
}
else {
	$title="Character Test Page";
  $announce="Your text is below!";
}
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title><?php echo($title)?></title>
</head>

<body bgcolor="#ddeedd">
<div align="center">
<table border="0" cellpadding="0" cellspacing="0" summary="" width="580">
<tr><td align="center">

<p><b><font color="#000000" size="5"><?php echo($title)?></font></b></p>

<form method="POST" action="#">
    <p><b>Message Text:</b> <input type="text" name="msgText" value="<?php echo $msgText;?>" /></p>
    <p><font size="3"><input type="submit" value="submit" name="submit" style="font-family: Arial; font-size: 12pt; font-weight: bold"></font></p>
  </form>
  <b><font face="Arial" size="4" color="#e00000"><?php echo($announce)?></font></b><br><br>

</td></tr>
</table> 
<table border="1" cellpadding="2" cellspacing="0" summary="">
<tr><td><p>Plain Text:</p></td><td><p><?php echo $msgText;?></p></td><td>&nbsp;</td></tr>
<tr><td><p>Using htmlspecialchars():</p></td><td><p><?php echo htmlentities($msgText);?></p><input type="text" name="msgText" value="<?php echo htmlentities($msgText);?>" /></td>
<td><a href="http://php.net/manual/en/function.htmlspecialchars.php">http://php.net/manual/en/function.htmlspecialchars.php</a>
</td></tr>
<tr><td><p>Using htmlentities():</p></td><td><p><?php echo htmlspecialchars($msgText);?></p>
<input type="text" name="msgText" value="<?php echo htmlspecialchars($msgText)?>" /></td>
<td><a href="http://php.net/manual/en/function.htmlentities.php">http://php.net/manual/en/function.htmlentities.php</a></td></tr>
</tr>
</table>
<p>Note that the pages on both functions refer to the character set you want to use.</p>

</div>

</body>
</html>

Open in new window

PHP-CharTestGet.php
<?php
error_reporting(E_ALL);
ini_set('display_errors','1');

# some settings of GET vars
if (!isset($_GET['submit']))  $submit = ''; else $submit = $_GET['submit'];
if (!isset($_GET['msgText'])) $msgText = ''; else $msgText = $_GET['msgText'];

if ($submit == "") {
    $title="Test Page";
    $announce="---";
}
else {
	$title="Character Test Page";
  $announce="Your text is below!";
}
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title><?php echo($title)?></title>
</head>

<body bgcolor="#ddeedd">
<div align="center">
<table border="0" cellpadding="0" cellspacing="0" summary="" width="580">
<tr><td align="center">

<p><b><font color="#000000" size="5"><?php echo($title)?></font></b></p>

<form method="GET" action="#">
    <p><b>Message Text:</b> <input type="text" name="msgText" value="<?php echo $msgText;?>" /></p>
    <p><font size="3"><input type="submit" value="submit" name="submit" style="font-family: Arial; font-size: 12pt; font-weight: bold"></font></p>
  </form>
  <b><font face="Arial" size="4" color="#e00000"><?php echo($announce)?></font></b><br><br>

</td></tr>
</table> 
<table border="1" cellpadding="2" cellspacing="0" summary="">
<tr><td><p>Plain Text:</p></td><td><p><?php echo $msgText;?></p></td><td>&nbsp;</td></tr>
<tr><td><p>Using htmlspecialchars():</p></td><td><p><?php echo htmlentities($msgText);?></p><input type="text" name="msgText" value="<?php echo htmlentities($msgText);?>" /></td>
<td><a href="http://php.net/manual/en/function.htmlspecialchars.php">http://php.net/manual/en/function.htmlspecialchars.php</a>
</td></tr>
<tr><td><p>Using htmlentities():</p></td><td><p><?php echo htmlspecialchars($msgText);?></p>
<input type="text" name="msgText" value="<?php echo htmlspecialchars($msgText)?>" /></td>
<td><a href="http://php.net/manual/en/function.htmlentities.php">http://php.net/manual/en/function.htmlentities.php</a></td></tr>
</tr>
</table>
<p>Note that the pages on both functions refer to the character set you want to use.</p>

</div>

</body>
</html>

Open in new window

There are some other ideas related to character sets in this article.  Since PHP has changed its posture on character encoding at V5.4, this is something we all need to understand.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html
preserving the fidelity of my users' inputs
That is NOT something you want to do.  Please read this link, including the user-contributed notes, then come back to the rest of this comment.
http://php.net/manual/en/language.variables.external.php

For at least a decade our security mantra has been Filter Input, Escape Output.  Let me try to explain how that works in practice.

On the input side of things, you'll receive some sort of external input from some external source (HTTP Cookies, client input from an HTML form, etc).  These external inputs are, by definition, tainted because you do not know what they contain.  It could be the data you want, or useless gibberish, or toxic JavaScript.  Whatever it is, you want to filter it before you use it for any purpose in your PHP script, other than the single purpose of storing it verbatim as a record of the input you received.

Inside your scripts you use the filtered data, and you never make reference to the external data again.  It is tainted - don't touch it.

On the output side of things, you would never want to send unescaped output to a client.  Again, the unescaped data may be the data you want, or useless gibberish, or toxic JavaScript.  Unless you created the data and / or filtered the data, you don't know what it contains.  And it may contain markup that can affect the behavior of the client's browser.  So to avoid adverse effects of the client browser, we use something like HTMLEntities() to make the data safe before we echo the data to the client browser.  This is called "escaping" the output data.

When you create a simple test that takes input from one client and regurgitates it to the same client, you can see the character-by-character behavior as the data is received, stored and returned.  But this test is too simple to stand up in the real world.  In the real world, data from one client will be shown to other clients and the levels of trust between the clients and your web site are vitally important.  That is why we filter input and escape output -- to keep and secure these levels of trust.

The information that is show to the client can take more than one form.  Let's look a the left-wicket character: <  Used all by itself, it can mean "less than" or it can have the meta-meaning that starts an HTML tag or JavaScript string.  The context and surrounding data determines its meaning or meta-meaning.  If you echo a left-wicket character to the client browser you're at risk that the meta-meaning may be invoked by the browser and it may set off a cascade of unpleasant JavaScript events.  So instead of accepting that risk, we escape the left-wicket character and it is turned from a single character into this string (spaces inserted for readability): & l t ;

The & l t ; is the character entity for the left-wicket.  It will be displayed by the browser as a left-wicket: < but it will never be misinterpreted as the meta-character, so it cannot damage browser display or trigger an attack with toxic JavaScript.
Avatar of elepil

ASKER

To Dave and Ray, I appreciate your responses.

I believe though that both of you are making my issue more complex than it should be. Dave, note that I am making no form submission, which your examples somehow involve. I explicitly demonstrated in my code snippet how I am getting the value from the SESSION, and then embedding short PHP scriptlets within the HTML. This far I know for a fact, if I had used AJAX to pull in the user information again and used jQuery/JavaScript to populate the html text fields, it would work without a problem. But AJAX involves a server call AFTER the page has been rendered, that's why it works.

My issue is a bit different. By using scriptlets, PHP will now try to render the page on the server every single scriptlet before sending the page back to the browser. So it sees an < input value="<?php echo $firstname;?>">, it will plug in <input value="O"Malley"> in the page. What I was hoping was that I could apply a function to the data (e.g. rawurlencode()) as it is pulled out of the session, and then on the scriptlet, there would be a function to decode it correctly.

And Ray, I don't think I will debate you on why it's important to preserve the fidelity of my users' inputs. If I provide a notes fields, and they want to put in "This O'Malley guy is an <HTML> fanatic", imagine what would happen if it became something else after he saves it. If I were writing a forum software like this one used at EE, it is important to preserve what the user inputs because they typed in their text for a reason. If they typed in a name as John "The Hammer" O'Malley, I as a developer feel compelled to display the text the way he typed it in the first place.

In short, all I was hoping for by posting on this forum is a simple answer to the simple question I presented. I get a little overwhelmed when I receive link after link of lengthy 'haystack' tutorials, general comments without specifics, or attempts to challenge my UI philosophy, all of which does not help me one bit.
ASKER CERTIFIED SOLUTION
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
We still seem to be miles apart in understanding how characters get encoded for HTTP transfer, stored in SQL databases, etc.  Let's look at this:
If they typed in a name as John "The Hammer" O'Malley, I as a developer feel compelled to display the text the way he typed it in the first place.
 The key word here is "display."  Your scripts should receive external input and store it unalloyed in the database.  In order to do this, you must escape the character strings before putting them into the database.  When you get them out of the database, you must prepare them for browser display with htmlentities() or similar.  Browsers display strings with special attention to two types of data: metacharacters and character entities.  Metacharacters have special meaning that affect the behavior of the browser.  Metacharacters include quotes, wickets, and things like that.  Character entities are used to represent metacharacters in such a way that these characters do not have any effect on the browser's behavior.

Most browsers have a "view source" feature.  This is your friend.  If you believe what you see on the screen, you're only getting a fragment of the story.  You will only see what the browser wants you to see, not the data that caused the browser to create the visible display.

You think this is simple, but it's not simple, and it's got years of research and security development behind it.  We can show you the best practices (and we are trying to do that) but if you do not read the linked articles, carefully, for comprehension, you will continue to oversimplify and misunderstand the central issues that create havoc for the users of web sites that do not understand the risks and remedies.  Or as the great firefighter Red Adair said, "If you think it's expensive to hire a professional, just wait till you hire an amateur!"

I'll try to put together a script that shows you the moving parts of data-in and data-out, but please try to read and understand the background information we are holding out for you.  You can look at an apple pie and appreciate many of its qualities, but you cannot learn to bake an apple pie by looking at an apple pie.  You cannot learn the (decade old) theory of Filter Input Escape Output security by looking at our code examples, either.  You gotta know the "why."
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Here is the process we use to make external data safe for display in the browser.  Worth reading the man page references; some of the behaviors are release-dependent.
http://iconoun.com/demo/temp_elepil_2.php
<?php // demo/temp_elepil_2.php

/**
 * http://www.experts-exchange.com/questions/28711795/PHP-encoding-decoding-problem.html#a40970122
 *
 * Using external request data in an HTML display document that is produced by a PHP script
 *
 * Man Page References:
 * http://php.net/manual/en/function.htmlentities.php
 * http://php.net/manual/en/ini.core.php#ini.default-charset
 *
 * Note: You CANNOT trust the browser display, neither in the address bar URL, nor in the body
 *       of the document you see in the viewport.  You MUST use "view source" to know what the
 *       browser is using to create the viewport display you see.
 *
 * Test with these, or just type in your own tests:
 * http://iconoun.com/demo/temp_elepil_2.php
 * http://iconoun.com/demo/temp_elepil_2.php?q=Hello
 * http://iconoun.com/demo/temp_elepil_2.php?q=Foo+Bar
 * http://iconoun.com/demo/temp_elepil_2.php?q=Foo%20Bar
 * http://iconoun.com/demo/temp_elepil_2.php?q=O%27Malley
 * http://iconoun.com/demo/temp_elepil_2.php?q=O%22Malley
 * http://iconoun.com/demo/temp_elepil_2.php?q=%22O%22Malley%22
 * http://iconoun.com/demo/temp_elepil_2.php?q=%22O%27Malley%22
 */
error_reporting(E_ALL);


// SET THE INTERNAL VARIABLE TO AN EXACT COPY OF THE EXTERNAL REQUEST VARIABLE
$q
= !empty($_GET['q'])
? $_GET['q']
: NULL
;

// MAKE THE VARIABLE SAFE FOR BROWSER DISPLAY
$safe_for_display_q = htmlentities($q);


// PRODUCE THE HTML DOCUMENT
$htm = <<<EOD
<p>HERE IS THE SAFE VERSION OF THE REQUEST VARIABLE: <b><i>$safe_for_display_q</i></b></p>
<form>
<input name="q" value="$safe_for_display_q" />
<input type="submit" />
</form>
EOD;


echo $htm;

Open in new window

Here is the process we use to make external data safe for use in a query string.  Add your own DB connection information and you can run it to see the effects on the external data.
http://iconoun.com/demo/temp_elepil_3.php
<?php // demo/temp_elepil_3.php

/**
 * http://www.experts-exchange.com/questions/28711795/PHP-encoding-decoding-problem.html#a40970122
 *
 * Storing external request data in a database
 *
 * Man Page References:
 * http://php.net/manual/en/book.mysqli.php
 *
 * Test with these, or just type in your own tests:
 * http://iconoun.com/demo/temp_elepil_3.php
 * http://iconoun.com/demo/temp_elepil_3.php?q=Hello
 * http://iconoun.com/demo/temp_elepil_3.php?q=Foo+Bar
 * http://iconoun.com/demo/temp_elepil_3.php?q=Foo%20Bar
 * http://iconoun.com/demo/temp_elepil_3.php?q=O%27Malley
 * http://iconoun.com/demo/temp_elepil_3.php?q=O%22Malley
 * http://iconoun.com/demo/temp_elepil_3.php?q=%22O%22Malley%22
 * http://iconoun.com/demo/temp_elepil_3.php?q=%22O%27Malley%22
 */
error_reporting(E_ALL);


// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);


// SET THE INTERNAL VARIABLE TO AN EXACT COPY OF THE EXTERNAL REQUEST VARIABLE
$q
= !empty($_GET['q'])
? $_GET['q']
: NULL
;

// MAKE THE VARIABLE SAFE FOR BROWSER DISPLAY
$safe_for_display_q = htmlentities($q);

// MAKE THE VARIABLE SAFE FOR USE IN THE DATABASE
$safe_for_database_q = $mysqli->real_escape_string($q);

// MAKE A DISPLAY-SAFE VERSION OF THE ESCAPED VARIABLE
$safe_for_display_safe_for_database_q = htmlentities($safe_for_database_q);


// PRODUCE THE HTML DOCUMENT
$htm = <<<EOD
<p>HERE IS THE DBMS-SAFE VERSION OF THE REQUEST VARIABLE: <b><i>$safe_for_display_safe_for_database_q</i></b></p>
<form>
<input name="q" value="$safe_for_display_q" />
<input type="submit" />
</form>
EOD;

echo $htm;

Open in new window

HTH, ~Ray
Avatar of elepil

ASKER

The problem I was having was that the two operations I was doing, extracting the data from the session and trying to display this extracted data via scriptlets, are both being handled at the same time by the PHP interpreter on the server. While the page was being rendered, it was impossible to insert a value in the "value=" attribute anything with double quotes.

So what I wound up doing was reading the record from the database again and passing it to the page AFTER the page has already been rendered.

Thanks for the help.
You're welcome, glad you go it worked out.