?
Solved

comparing strings that contain line breaks

Posted on 2016-11-04
18
Medium Priority
?
83 Views
Last Modified: 2016-11-19
Hi - I am trying to compare two strings, they are identical, the only difference being that I am retrieving one from the database, and another from a text control.  Although they look the same when they are outputted to the screen, and the text within the source code is exactly the same, they are not evaluating as the same when using '==' or php's strcmp function.

Any thoughts?

Thanks,
Pete


i.e.
string_fromdatabase != string_fromcontrol (i.e. post array)

string:
123 Springfield Avenue
NY, NY
0
Comment
Question by:shafer23
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 5
  • 4
18 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 41874667
You might consider normalizing the whitespace.  See Example #4.  Next trim() the strings.  Then try the comparison again, and if that doesn't cure the problem, we need to see the exact  data from both sources.
0
 

Author Comment

by:shafer23
ID: 41874720
Thanks Ray, I tried that solution and it converted the carriage return in the control/post data to a space, but it did not do the same to the data that I retrieved from the database.

I thought that perhaps I could place each row of text in an array and then compare the row data with respect to each array.
To do so, I converted to the strings to arrays as below.

$string_fromdatabasearray = explode ("\n", $string_fromdatabase);
$string_fromcontrolarray = explode ("\n", $string_fromcontrol);

I then outputted the arrays to the screen, and it shows that only the string from the database is being placed in separate array elements, i.e. the string from the control/POST data is being placed within an array with a single element.  I think that this shows that the line break from the POST array does not correspond to "\n".

Do you have any idea what it does equate to?  If so, I can use my above approach and compare each array element (unless there is a better approach).

Thanks,
Pete
0
 

Author Comment

by:shafer23
ID: 41874736
i.e. here was the output from the array:

string_fromdatabasearray:
Array ( [0] => 123 Springfield Avenue [1] => NY, NY  )

$string_fromcontrolarray:
Array ( [0] => 123 Springfield Avenue NY, NY )
0
Learn how to optimize MySQL for your business need

With the increasing importance of apps & networks in both business & personal interconnections, perfor. has become one of the key metrics of successful communication. This ebook is a hands-on business-case-driven guide to understanding MySQL query parameter tuning & database perf

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 41874776
This might be a matter of how the end-of-line characters are created and stored.  Anything you send to the MySQL engine will be stored verbatim.  In some environments, the EOL character is \n.  In others it's \r.  In still others (IIRC Windows) it's\r\n.  The whitespace removal plus trim() should work for all of them, but you would want to normalize both fields with the same strategy.

If you want to copy / paste both of the sample data elements here in the code snippet, I'll be glad to show you the code that can work for this.
1
 

Author Comment

by:shafer23
ID: 41874809
Thanks, Ray - that does sound like the issue.

$string_fromdatabase = '123 Springfield Avenue\nNew York, NY';
$string_fromcontrol = '123 Springfield Avenue New York, NY';     note, the space prior to New York corresponds to a line break

I am trying to determine if the string submitted via the control differs from the string that is retrieved from the database, so consistent with what you mentioned, it seems as though I need to convert the line breaks in each variable to the same thing (e.g. \n or something else) and then compare the strings.

Thanks for your help.
Pete
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1600 total points
ID: 41874837
Glad to help.  Here's a little thought experiment with test data that may make it clear (and I think your instincts are right about the issue).
https://iconoun.com/demo/temp_shafer23.php
<?php // demo/temp_shafer23.php
/**
 * https://www.experts-exchange.com/questions/28981124/comparing-strings-that-contain-line-breaks.html
 *
 * https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);

// COLLECTION OF TEST DATA STRINGS WITH DIFFERENT EOL CHARACTERS
$old =
[ "123 Springfield Avenue NY, NY"
, "123 Springfield Avenue\nNY, NY"
, "123 Springfield Avenue\rNY, NY"
, "123 Springfield Avenue\r\nNY, NY\n"
, "123 Springfield Avenue\r\nNY, NY\r\n"
]
;

// SHOW HOW THEY LOOK IN THE BROWSER WINDOW
foreach ($old as $str)
{
    echo PHP_EOL . "<br>$str";
}

// NOW SWITCH TO PREFORMATTED DISPLAY AND COMPARE TO BROWSER DISPLAY
echo '<pre>';
var_dump($old);

// A REGULAR EXPRESSION TO NORMALIZE WHITESPACE
$rgx
= '#'        // REGEX DELIMITER
. '\s\s+'    // CONSECUTIVE WHITESPACE CHARACTERS
. '|'        // OR
. '\s'       // SINGLE WHITESPACE CHARACTERS
. '#'        // REGEX DELIMITER
;

// NORMALIZE AND TRIM THEM
$new = [];
foreach ($old as $str)
{
    $str = preg_replace($rgx, ' ', $str);
    $str = trim($str);
    $new[] = $str;
}

// SHOW THE NORMALIZED STRINGS
var_dump($new);

Open in new window

0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 41874845
Sidebar note... PHP has a self-aware and predefined constant, PHP_EOL, that will give you the correct end-of-line character for the operating system you're using.
0
 
LVL 59

Assisted Solution

by:Julian Hansen
Julian Hansen earned 400 total points
ID: 41875957
Try running bin2hex() on each of the strings and sending the result to the screen. That way you will see what is hiding in the string that might be causing the comparison to fail
For
str1 === str2

The hex dump of both must be identical.
If unsure paste the dump back here
2
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 41876066
I found bin2hex() to be a little limited in its exposition, especially with multi-byte character sets.  FWIW, here is how I solved the problem in this question.  It comes in handy, especially if there is malformed UTF-8, as often happens when someone wants to JSON-encode European characters.
http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
<?php // demo/hexdump_unicode_v.php
/**
 * Expand and display a string variable in hexadecimal notation
 * Note: This may not look right without a unispace font!
 * http://php.net/manual/en/function.mb-split.php#99851
 *
 * http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
 *
 * Useful: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=1024&utf8=0x&unicodeinhtml=hex
 *
 * @param string $str The variable to expand and display
 * @return none (direct browser output)
 */
error_reporting(E_ALL);
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');


Class Letter
{
    public function __construct($chr)
    {
        $this->chr = $chr;
        $this->hex = array();
        $bytes     = $this->usplit($chr);
        foreach ($bytes as $byte)
        {
            $this->hex = array_merge($this->hex, $this->gethex($byte));
        }
        return $this;
    }

    public function usplit ($chr)
    {
        $len = strlen($chr);
        while ($len) {
            $arr[] = substr($chr, 0, 1);
            $chr   = substr($chr, 1, $len);
            $len   = strlen($chr);
        }
        return $arr;
    }

    public function gethex($chr)
    {
        // GET THE HEX NIBBLE VALUES IN AN ARRAY
        $ret = str_split(implode(NULL, unpack('H*', $chr)));
        return $ret;
    }
}

Class Hexdump
{
    public function __construct($str)
    {
        $this->str = $str;
        $this->arr = $this->mb_str_split($str);
        $this->len = mb_strlen($str);
        foreach ($this->arr as $uchr)
        {
            $this->dat[] = new Letter($uchr);
        }
        return $this;
    }

    public function mb_str_split($ustr)
    {
        return preg_split('/(?<!^)(?!$)/u', $ustr);
    }

    public function render($br = PHP_EOL)
    {
        foreach ($this->dat as $poz => $chr)
        {
            echo $br;
            echo str_pad($poz, 4, ' ', STR_PAD_LEFT);
            echo ' ';
            echo $chr->chr;
            echo "\t";
            echo implode(null, $chr->hex);
        }
        echo $br;
    }
}

// DEMONSTRATE IT WITH THE REQUEST ARGUMENT
echo '<meta charset="utf-8" />';
echo '<pre>';

$q = !empty($_GET['q']) ? $_GET['q'] : 'Vöila';
var_dump($q);

$y = new Hexdump($q);
$y->render();

Open in new window

1
 

Author Comment

by:shafer23
ID: 41879226
Thanks for the additional thoughts; I have been caught up with a few other things and will get back to this in a few days - I will let you know what my final approach is.  thanks again!
0
 

Author Comment

by:shafer23
ID: 41888123
Ray - I used a variant of your suggestion and simply removed the single whitespace characters, this seems to eliminate the issue that was being caused by the end of line characters.  Thanks again for the feedback!


$rgx
= '#'        // REGEX DELIMITER
. '\s'       // SINGLE WHITESPACE CHARACTERS
. '#'        // REGEX DELIMITER
;

// convert whitespace
$input1 = trim(preg_replace($rgx, ' ', $input1));
$input2 = trim(preg_replace($rgx, ' ', $input2));

// determine if strings are same
if ($input1 == $input2)      $same = true;
else                              $same = false;
0
 

Author Comment

by:shafer23
ID: 41893981
Hi - not sure if anyone is still looking at this, but I spoke too soon in providing the above solution.  If I use the approach mentioned above and simply remove a single whitespace character (via the preg_replace) routine and the EOL character is \r\n (which is the case since I am using a windows test server), it does not equate that to data that I load directly into the database (which uses \n for line breaks) because of the additional \r

If, however, I also replace consecutive whitespace characters as well (by inserting \s\s+ | \s within the preg_replace function), it does not recognize that a change has been made if, for example, I insert an additional line break within the data displayed within a text control.  

I also tried the below statement to replace the EOL with \n prior to saving it, but that did not produce the desired result.
$streetaddress = str_replace(PHP_EOL, "\n", $streetaddress)

Ray - not sure if you are still looking at this, but I see that you provided some additional logic within your above discussion of the bin2hex statement.  Frankly, I did not follow all of it, but is that the approach that I should be using to compare two strings?

Thanks,
Pete
0
 

Author Comment

by:shafer23
ID: 41893989
I made a few changes to my approach and below seems to work fine:

$rgx
= '#'        // REGEX DELIMITER
. '\R'       // LINE BREAKS
. '#'        // REGEX DELIMITER
;

// convert line breaks
$input1 = trim(preg_replace($rgx, ' /n', $input1));
$input2 = trim(preg_replace($rgx, ' /n', $input2));

// determine if strings are same
if ($input1 == $input2)    $same = true;
else                                   $same = false;


Thanks again for your help, and please feel free to add any comments if you think there are any issues with this approach.

Pete
0
 
LVL 59

Expert Comment

by:Julian Hansen
ID: 41893990
Not sure I am understanding the problem but can you give us a sample of the data - what it looks like and what you want it to look like.
To get rid of the \r you can do something like this preg_relplace('/\r\n/', '\n', $str); but without seeing your data that is only a guess.
1
 

Author Comment

by:shafer23
ID: 41893998
Hi - thanks, the data contains both \r\n and \n

I had previously tried replacing whitespace by using the below statements but I was not getting the desired result:
preg_replace('/\s/', ' ', $str)
preg_replace('/\s\s+|\s/', ' ', $str)

But, yes, consistent with what you had mentioned, I think that the below approach works fine.
preg_replace('/\R/', '\n', $str) - note, \R apparently replaces all variations of line breaks.

Thanks again,
Pete
0
 
LVL 59

Expert Comment

by:Julian Hansen
ID: 41894009
Not sure you want to do this
preg_replace('/\R/', '\n', $str);

Open in new window

That will create double line breaks
More like
preg_replace('/\r/', '', $str);

Open in new window

Alterntively look at your import process to make it flexible about \n  vs \r\n
0
 

Author Comment

by:shafer23
ID: 41894346
Hi - thanks, my understanding is that replacing \R (as opposed to \r) with \n will replace all line breaks with a single line break, so it seems like the appropriate approach.

i.e. it replaces \r\n, \r and \n with \n, such that if data is imported from different servers (Mac, Windows, or Unix), the line breaks will be normalized such that two strings can be compared.
0
 
LVL 59

Expert Comment

by:Julian Hansen
ID: 41894546
You are correct
0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
Ready to get certified? Check out some courses that help you prepare for third-party exams.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will get a basic understanding of what section 508 compliance can entail, learn about skip navigation links, alt text, transcripts, and font size controls.
Suggested Courses

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question