asked on

comparing strings that contain line breaks

Hi - I am trying to compare two strings, they are identical, the only difference being that I am retrieving one from the database, and another from a text control. Although they look the same when they are outputted to the screen, and the text within the source code is exactly the same, they are not evaluating as the same when using '==' or php's strcmp function.

Any thoughts?

Thanks,
Pete

i.e.
string_fromdatabase != string_fromcontrol (i.e. post array)

string:
123 Springfield Avenue
NY, NY

Ray Paseur

You might consider normalizing the whitespace. See Example #4. Next trim() the strings. Then try the comparison again, and if that doesn't cure the problem, we need to see the exact data from both sources.

Pete C

ASKER

Thanks Ray, I tried that solution and it converted the carriage return in the control/post data to a space, but it did not do the same to the data that I retrieved from the database.

I thought that perhaps I could place each row of text in an array and then compare the row data with respect to each array.
To do so, I converted to the strings to arrays as below.

$string_fromdatabasearray = explode ("\n", $string_fromdatabase);
$string_fromcontrolarray = explode ("\n", $string_fromcontrol);

I then outputted the arrays to the screen, and it shows that only the string from the database is being placed in separate array elements, i.e. the string from the control/POST data is being placed within an array with a single element. I think that this shows that the line break from the POST array does not correspond to "\n".

Do you have any idea what it does equate to? If so, I can use my above approach and compare each array element (unless there is a better approach).

Thanks,
Pete

Pete C

ASKER

i.e. here was the output from the array:

string_fromdatabasearray:
Array ( [0] => 123 Springfield Avenue [1] => NY, NY )

$string_fromcontrolarray:
Array ( [0] => 123 Springfield Avenue NY, NY )

Ray Paseur

This might be a matter of how the end-of-line characters are created and stored. Anything you send to the MySQL engine will be stored verbatim. In some environments, the EOL character is \n. In others it's \r. In still others (IIRC Windows) it's\r\n. The whitespace removal plus trim() should work for all of them, but you would want to normalize both fields with the same strategy.

If you want to copy / paste both of the sample data elements here in the code snippet, I'll be glad to show you the code that can work for this.

Pete C

ASKER

Thanks, Ray - that does sound like the issue.

$string_fromdatabase = '123 Springfield Avenue\nNew York, NY';
$string_fromcontrol = '123 Springfield Avenue New York, NY'; note, the space prior to New York corresponds to a line break

I am trying to determine if the string submitted via the control differs from the string that is retrieved from the database, so consistent with what you mentioned, it seems as though I need to convert the line breaks in each variable to the same thing (e.g. \n or something else) and then compare the strings.

Thanks for your help.
Pete

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

Sidebar note... PHP has a self-aware and predefined constant, PHP_EOL, that will give you the correct end-of-line character for the operating system you're using.

SOLUTION

Julian Hansen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

I found bin2hex() to be a little limited in its exposition, especially with multi-byte character sets. FWIW, here is how I solved the problem in this question. It comes in handy, especially if there is malformed UTF-8, as often happens when someone wants to JSON-encode European characters.
http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!

<?php // demo/hexdump_unicode_v.php
/**
 * Expand and display a string variable in hexadecimal notation
 * Note: This may not look right without a unispace font!
 * http://php.net/manual/en/function.mb-split.php#99851
 *
 * http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
 *
 * Useful: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=1024&utf8=0x&unicodeinhtml=hex
 *
 * @param string $str The variable to expand and display
 * @return none (direct browser output)
 */
error_reporting(E_ALL);
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');


Class Letter
{
    public function __construct($chr)
    {
        $this->chr = $chr;
        $this->hex = array();
        $bytes     = $this->usplit($chr);
        foreach ($bytes as $byte)
        {
            $this->hex = array_merge($this->hex, $this->gethex($byte));
        }
        return $this;
    }

    public function usplit ($chr)
    {
        $len = strlen($chr);
        while ($len) {
            $arr[] = substr($chr, 0, 1);
            $chr   = substr($chr, 1, $len);
            $len   = strlen($chr);
        }
        return $arr;
    }

    public function gethex($chr)
    {
        // GET THE HEX NIBBLE VALUES IN AN ARRAY
        $ret = str_split(implode(NULL, unpack('H*', $chr)));
        return $ret;
    }
}

Class Hexdump
{
    public function __construct($str)
    {
        $this->str = $str;
        $this->arr = $this->mb_str_split($str);
        $this->len = mb_strlen($str);
        foreach ($this->arr as $uchr)
        {
            $this->dat[] = new Letter($uchr);
        }
        return $this;
    }

    public function mb_str_split($ustr)
    {
        return preg_split('/(?<!^)(?!$)/u', $ustr);
    }

    public function render($br = PHP_EOL)
    {
        foreach ($this->dat as $poz => $chr)
        {
            echo $br;
            echo str_pad($poz, 4, ' ', STR_PAD_LEFT);
            echo ' ';
            echo $chr->chr;
            echo "\t";
            echo implode(null, $chr->hex);
        }
        echo $br;
    }
}

// DEMONSTRATE IT WITH THE REQUEST ARGUMENT
echo '<meta charset="utf-8" />';
echo '<pre>';

$q = !empty($_GET['q']) ? $_GET['q'] : 'Vöila';
var_dump($q);

$y = new Hexdump($q);
$y->render();

Open in new window

Pete C

ASKER

Thanks for the additional thoughts; I have been caught up with a few other things and will get back to this in a few days - I will let you know what my final approach is. thanks again!

Pete C

ASKER

Ray - I used a variant of your suggestion and simply removed the single whitespace characters, this seems to eliminate the issue that was being caused by the end of line characters. Thanks again for the feedback!

$rgx
= '#' // REGEX DELIMITER
. '\s' // SINGLE WHITESPACE CHARACTERS
. '#' // REGEX DELIMITER
;

// convert whitespace
$input1 = trim(preg_replace($rgx, ' ', $input1));
$input2 = trim(preg_replace($rgx, ' ', $input2));

// determine if strings are same
if ($input1 == $input2) $same = true;
else $same = false;

Pete C

ASKER

Hi - not sure if anyone is still looking at this, but I spoke too soon in providing the above solution. If I use the approach mentioned above and simply remove a single whitespace character (via the preg_replace) routine and the EOL character is \r\n (which is the case since I am using a windows test server), it does not equate that to data that I load directly into the database (which uses \n for line breaks) because of the additional \r

If, however, I also replace consecutive whitespace characters as well (by inserting \s\s+ | \s within the preg_replace function), it does not recognize that a change has been made if, for example, I insert an additional line break within the data displayed within a text control.

I also tried the below statement to replace the EOL with \n prior to saving it, but that did not produce the desired result.
$streetaddress = str_replace(PHP_EOL, "\n", $streetaddress)

Ray - not sure if you are still looking at this, but I see that you provided some additional logic within your above discussion of the bin2hex statement. Frankly, I did not follow all of it, but is that the approach that I should be using to compare two strings?

Thanks,
Pete

Pete C

ASKER

I made a few changes to my approach and below seems to work fine:

$rgx
= '#' // REGEX DELIMITER
. '\R' // LINE BREAKS
. '#' // REGEX DELIMITER
;

// convert line breaks
$input1 = trim(preg_replace($rgx, ' /n', $input1));
$input2 = trim(preg_replace($rgx, ' /n', $input2));

// determine if strings are same
if ($input1 == $input2) $same = true;
else $same = false;

Thanks again for your help, and please feel free to add any comments if you think there are any issues with this approach.

Pete

Julian Hansen

Not sure I am understanding the problem but can you give us a sample of the data - what it looks like and what you want it to look like.
To get rid of the \r you can do something like this preg_relplace('/\r\n/', '\n', $str); but without seeing your data that is only a guess.

Pete C

ASKER

Hi - thanks, the data contains both \r\n and \n

I had previously tried replacing whitespace by using the below statements but I was not getting the desired result:
preg_replace('/\s/', ' ', $str)
preg_replace('/\s\s+|\s/', ' ', $str)

But, yes, consistent with what you had mentioned, I think that the below approach works fine.
preg_replace('/\R/', '\n', $str) - note, \R apparently replaces all variations of line breaks.

Thanks again,
Pete

Julian Hansen

Not sure you want to do this

preg_replace('/\R/', '\n', $str);

Open in new window

That will create double line breaks
More like

preg_replace('/\r/', '', $str);

Open in new window

Alterntively look at your import process to make it flexible about \n vs \r\n

Pete C

ASKER

Hi - thanks, my understanding is that replacing \R (as opposed to \r) with \n will replace all line breaks with a single line break, so it seems like the appropriate approach.

i.e. it replaces \r\n, \r and \n with \n, such that if data is imported from different servers (Mac, Windows, or Unix), the line breaks will be normalized such that two strings can be compared.

Julian Hansen

You are correct