Link to home
Start Free TrialLog in
Avatar of Pete C
Pete CFlag for United States of America

asked on

comparing strings that contain line breaks

Hi - I am trying to compare two strings, they are identical, the only difference being that I am retrieving one from the database, and another from a text control.  Although they look the same when they are outputted to the screen, and the text within the source code is exactly the same, they are not evaluating as the same when using '==' or php's strcmp function.

Any thoughts?

Thanks,
Pete


i.e.
string_fromdatabase != string_fromcontrol (i.e. post array)

string:
123 Springfield Avenue
NY, NY
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

You might consider normalizing the whitespace.  See Example #4.  Next trim() the strings.  Then try the comparison again, and if that doesn't cure the problem, we need to see the exact  data from both sources.
Avatar of Pete C

ASKER

Thanks Ray, I tried that solution and it converted the carriage return in the control/post data to a space, but it did not do the same to the data that I retrieved from the database.

I thought that perhaps I could place each row of text in an array and then compare the row data with respect to each array.
To do so, I converted to the strings to arrays as below.

$string_fromdatabasearray = explode ("\n", $string_fromdatabase);
$string_fromcontrolarray = explode ("\n", $string_fromcontrol);

I then outputted the arrays to the screen, and it shows that only the string from the database is being placed in separate array elements, i.e. the string from the control/POST data is being placed within an array with a single element.  I think that this shows that the line break from the POST array does not correspond to "\n".

Do you have any idea what it does equate to?  If so, I can use my above approach and compare each array element (unless there is a better approach).

Thanks,
Pete
Avatar of Pete C

ASKER

i.e. here was the output from the array:

string_fromdatabasearray:
Array ( [0] => 123 Springfield Avenue [1] => NY, NY  )

$string_fromcontrolarray:
Array ( [0] => 123 Springfield Avenue NY, NY )
This might be a matter of how the end-of-line characters are created and stored.  Anything you send to the MySQL engine will be stored verbatim.  In some environments, the EOL character is \n.  In others it's \r.  In still others (IIRC Windows) it's\r\n.  The whitespace removal plus trim() should work for all of them, but you would want to normalize both fields with the same strategy.

If you want to copy / paste both of the sample data elements here in the code snippet, I'll be glad to show you the code that can work for this.
Avatar of Pete C

ASKER

Thanks, Ray - that does sound like the issue.

$string_fromdatabase = '123 Springfield Avenue\nNew York, NY';
$string_fromcontrol = '123 Springfield Avenue New York, NY';     note, the space prior to New York corresponds to a line break

I am trying to determine if the string submitted via the control differs from the string that is retrieved from the database, so consistent with what you mentioned, it seems as though I need to convert the line breaks in each variable to the same thing (e.g. \n or something else) and then compare the strings.

Thanks for your help.
Pete
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sidebar note... PHP has a self-aware and predefined constant, PHP_EOL, that will give you the correct end-of-line character for the operating system you're using.
SOLUTION
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I found bin2hex() to be a little limited in its exposition, especially with multi-byte character sets.  FWIW, here is how I solved the problem in this question.  It comes in handy, especially if there is malformed UTF-8, as often happens when someone wants to JSON-encode European characters.
http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
<?php // demo/hexdump_unicode_v.php
/**
 * Expand and display a string variable in hexadecimal notation
 * Note: This may not look right without a unispace font!
 * http://php.net/manual/en/function.mb-split.php#99851
 *
 * http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
 *
 * Useful: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=1024&utf8=0x&unicodeinhtml=hex
 *
 * @param string $str The variable to expand and display
 * @return none (direct browser output)
 */
error_reporting(E_ALL);
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');


Class Letter
{
    public function __construct($chr)
    {
        $this->chr = $chr;
        $this->hex = array();
        $bytes     = $this->usplit($chr);
        foreach ($bytes as $byte)
        {
            $this->hex = array_merge($this->hex, $this->gethex($byte));
        }
        return $this;
    }

    public function usplit ($chr)
    {
        $len = strlen($chr);
        while ($len) {
            $arr[] = substr($chr, 0, 1);
            $chr   = substr($chr, 1, $len);
            $len   = strlen($chr);
        }
        return $arr;
    }

    public function gethex($chr)
    {
        // GET THE HEX NIBBLE VALUES IN AN ARRAY
        $ret = str_split(implode(NULL, unpack('H*', $chr)));
        return $ret;
    }
}

Class Hexdump
{
    public function __construct($str)
    {
        $this->str = $str;
        $this->arr = $this->mb_str_split($str);
        $this->len = mb_strlen($str);
        foreach ($this->arr as $uchr)
        {
            $this->dat[] = new Letter($uchr);
        }
        return $this;
    }

    public function mb_str_split($ustr)
    {
        return preg_split('/(?<!^)(?!$)/u', $ustr);
    }

    public function render($br = PHP_EOL)
    {
        foreach ($this->dat as $poz => $chr)
        {
            echo $br;
            echo str_pad($poz, 4, ' ', STR_PAD_LEFT);
            echo ' ';
            echo $chr->chr;
            echo "\t";
            echo implode(null, $chr->hex);
        }
        echo $br;
    }
}

// DEMONSTRATE IT WITH THE REQUEST ARGUMENT
echo '<meta charset="utf-8" />';
echo '<pre>';

$q = !empty($_GET['q']) ? $_GET['q'] : 'Vöila';
var_dump($q);

$y = new Hexdump($q);
$y->render();

Open in new window

Avatar of Pete C

ASKER

Thanks for the additional thoughts; I have been caught up with a few other things and will get back to this in a few days - I will let you know what my final approach is.  thanks again!
Avatar of Pete C

ASKER

Ray - I used a variant of your suggestion and simply removed the single whitespace characters, this seems to eliminate the issue that was being caused by the end of line characters.  Thanks again for the feedback!


$rgx
= '#'        // REGEX DELIMITER
. '\s'       // SINGLE WHITESPACE CHARACTERS
. '#'        // REGEX DELIMITER
;

// convert whitespace
$input1 = trim(preg_replace($rgx, ' ', $input1));
$input2 = trim(preg_replace($rgx, ' ', $input2));

// determine if strings are same
if ($input1 == $input2)      $same = true;
else                              $same = false;
Avatar of Pete C

ASKER

Hi - not sure if anyone is still looking at this, but I spoke too soon in providing the above solution.  If I use the approach mentioned above and simply remove a single whitespace character (via the preg_replace) routine and the EOL character is \r\n (which is the case since I am using a windows test server), it does not equate that to data that I load directly into the database (which uses \n for line breaks) because of the additional \r

If, however, I also replace consecutive whitespace characters as well (by inserting \s\s+ | \s within the preg_replace function), it does not recognize that a change has been made if, for example, I insert an additional line break within the data displayed within a text control.  

I also tried the below statement to replace the EOL with \n prior to saving it, but that did not produce the desired result.
$streetaddress = str_replace(PHP_EOL, "\n", $streetaddress)

Ray - not sure if you are still looking at this, but I see that you provided some additional logic within your above discussion of the bin2hex statement.  Frankly, I did not follow all of it, but is that the approach that I should be using to compare two strings?

Thanks,
Pete
Avatar of Pete C

ASKER

I made a few changes to my approach and below seems to work fine:

$rgx
= '#'        // REGEX DELIMITER
. '\R'       // LINE BREAKS
. '#'        // REGEX DELIMITER
;

// convert line breaks
$input1 = trim(preg_replace($rgx, ' /n', $input1));
$input2 = trim(preg_replace($rgx, ' /n', $input2));

// determine if strings are same
if ($input1 == $input2)    $same = true;
else                                   $same = false;


Thanks again for your help, and please feel free to add any comments if you think there are any issues with this approach.

Pete
Not sure I am understanding the problem but can you give us a sample of the data - what it looks like and what you want it to look like.
To get rid of the \r you can do something like this preg_relplace('/\r\n/', '\n', $str); but without seeing your data that is only a guess.
Avatar of Pete C

ASKER

Hi - thanks, the data contains both \r\n and \n

I had previously tried replacing whitespace by using the below statements but I was not getting the desired result:
preg_replace('/\s/', ' ', $str)
preg_replace('/\s\s+|\s/', ' ', $str)

But, yes, consistent with what you had mentioned, I think that the below approach works fine.
preg_replace('/\R/', '\n', $str) - note, \R apparently replaces all variations of line breaks.

Thanks again,
Pete
Not sure you want to do this
preg_replace('/\R/', '\n', $str);

Open in new window

That will create double line breaks
More like
preg_replace('/\r/', '', $str);

Open in new window

Alterntively look at your import process to make it flexible about \n  vs \r\n
Avatar of Pete C

ASKER

Hi - thanks, my understanding is that replacing \R (as opposed to \r) with \n will replace all line breaks with a single line break, so it seems like the appropriate approach.

i.e. it replaces \r\n, \r and \n with \n, such that if data is imported from different servers (Mac, Windows, or Unix), the line breaks will be normalized such that two strings can be compared.
You are correct