Solved

comparing strings that contain line breaks

Posted on 2016-11-04
18
35 Views
Last Modified: 2016-11-19
Hi - I am trying to compare two strings, they are identical, the only difference being that I am retrieving one from the database, and another from a text control.  Although they look the same when they are outputted to the screen, and the text within the source code is exactly the same, they are not evaluating as the same when using '==' or php's strcmp function.

Any thoughts?

Thanks,
Pete


i.e.
string_fromdatabase != string_fromcontrol (i.e. post array)

string:
123 Springfield Avenue
NY, NY
0
Comment
Question by:shafer23
  • 9
  • 5
  • 4
18 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 41874667
You might consider normalizing the whitespace.  See Example #4.  Next trim() the strings.  Then try the comparison again, and if that doesn't cure the problem, we need to see the exact  data from both sources.
0
 

Author Comment

by:shafer23
ID: 41874720
Thanks Ray, I tried that solution and it converted the carriage return in the control/post data to a space, but it did not do the same to the data that I retrieved from the database.

I thought that perhaps I could place each row of text in an array and then compare the row data with respect to each array.
To do so, I converted to the strings to arrays as below.

$string_fromdatabasearray = explode ("\n", $string_fromdatabase);
$string_fromcontrolarray = explode ("\n", $string_fromcontrol);

I then outputted the arrays to the screen, and it shows that only the string from the database is being placed in separate array elements, i.e. the string from the control/POST data is being placed within an array with a single element.  I think that this shows that the line break from the POST array does not correspond to "\n".

Do you have any idea what it does equate to?  If so, I can use my above approach and compare each array element (unless there is a better approach).

Thanks,
Pete
0
 

Author Comment

by:shafer23
ID: 41874736
i.e. here was the output from the array:

string_fromdatabasearray:
Array ( [0] => 123 Springfield Avenue [1] => NY, NY  )

$string_fromcontrolarray:
Array ( [0] => 123 Springfield Avenue NY, NY )
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 41874776
This might be a matter of how the end-of-line characters are created and stored.  Anything you send to the MySQL engine will be stored verbatim.  In some environments, the EOL character is \n.  In others it's \r.  In still others (IIRC Windows) it's\r\n.  The whitespace removal plus trim() should work for all of them, but you would want to normalize both fields with the same strategy.

If you want to copy / paste both of the sample data elements here in the code snippet, I'll be glad to show you the code that can work for this.
1
 

Author Comment

by:shafer23
ID: 41874809
Thanks, Ray - that does sound like the issue.

$string_fromdatabase = '123 Springfield Avenue\nNew York, NY';
$string_fromcontrol = '123 Springfield Avenue New York, NY';     note, the space prior to New York corresponds to a line break

I am trying to determine if the string submitted via the control differs from the string that is retrieved from the database, so consistent with what you mentioned, it seems as though I need to convert the line breaks in each variable to the same thing (e.g. \n or something else) and then compare the strings.

Thanks for your help.
Pete
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 400 total points
ID: 41874837
Glad to help.  Here's a little thought experiment with test data that may make it clear (and I think your instincts are right about the issue).
https://iconoun.com/demo/temp_shafer23.php
<?php // demo/temp_shafer23.php
/**
 * https://www.experts-exchange.com/questions/28981124/comparing-strings-that-contain-line-breaks.html
 *
 * https://www.experts-exchange.com/articles/7830/A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);

// COLLECTION OF TEST DATA STRINGS WITH DIFFERENT EOL CHARACTERS
$old =
[ "123 Springfield Avenue NY, NY"
, "123 Springfield Avenue\nNY, NY"
, "123 Springfield Avenue\rNY, NY"
, "123 Springfield Avenue\r\nNY, NY\n"
, "123 Springfield Avenue\r\nNY, NY\r\n"
]
;

// SHOW HOW THEY LOOK IN THE BROWSER WINDOW
foreach ($old as $str)
{
    echo PHP_EOL . "<br>$str";
}

// NOW SWITCH TO PREFORMATTED DISPLAY AND COMPARE TO BROWSER DISPLAY
echo '<pre>';
var_dump($old);

// A REGULAR EXPRESSION TO NORMALIZE WHITESPACE
$rgx
= '#'        // REGEX DELIMITER
. '\s\s+'    // CONSECUTIVE WHITESPACE CHARACTERS
. '|'        // OR
. '\s'       // SINGLE WHITESPACE CHARACTERS
. '#'        // REGEX DELIMITER
;

// NORMALIZE AND TRIM THEM
$new = [];
foreach ($old as $str)
{
    $str = preg_replace($rgx, ' ', $str);
    $str = trim($str);
    $new[] = $str;
}

// SHOW THE NORMALIZED STRINGS
var_dump($new);

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 41874845
Sidebar note... PHP has a self-aware and predefined constant, PHP_EOL, that will give you the correct end-of-line character for the operating system you're using.
0
 
LVL 51

Assisted Solution

by:Julian Hansen
Julian Hansen earned 100 total points
ID: 41875957
Try running bin2hex() on each of the strings and sending the result to the screen. That way you will see what is hiding in the string that might be causing the comparison to fail
For
str1 === str2

The hex dump of both must be identical.
If unsure paste the dump back here
2
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 41876066
I found bin2hex() to be a little limited in its exposition, especially with multi-byte character sets.  FWIW, here is how I solved the problem in this question.  It comes in handy, especially if there is malformed UTF-8, as often happens when someone wants to JSON-encode European characters.
http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
<?php // demo/hexdump_unicode_v.php
/**
 * Expand and display a string variable in hexadecimal notation
 * Note: This may not look right without a unispace font!
 * http://php.net/manual/en/function.mb-split.php#99851
 *
 * http://iconoun.com/demo/hexdump_unicode_v.php?q=Data:%E7%81%AB%E8%BD%A6%E7%A5%A8!!
 *
 * Useful: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=1024&utf8=0x&unicodeinhtml=hex
 *
 * @param string $str The variable to expand and display
 * @return none (direct browser output)
 */
error_reporting(E_ALL);
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');


Class Letter
{
    public function __construct($chr)
    {
        $this->chr = $chr;
        $this->hex = array();
        $bytes     = $this->usplit($chr);
        foreach ($bytes as $byte)
        {
            $this->hex = array_merge($this->hex, $this->gethex($byte));
        }
        return $this;
    }

    public function usplit ($chr)
    {
        $len = strlen($chr);
        while ($len) {
            $arr[] = substr($chr, 0, 1);
            $chr   = substr($chr, 1, $len);
            $len   = strlen($chr);
        }
        return $arr;
    }

    public function gethex($chr)
    {
        // GET THE HEX NIBBLE VALUES IN AN ARRAY
        $ret = str_split(implode(NULL, unpack('H*', $chr)));
        return $ret;
    }
}

Class Hexdump
{
    public function __construct($str)
    {
        $this->str = $str;
        $this->arr = $this->mb_str_split($str);
        $this->len = mb_strlen($str);
        foreach ($this->arr as $uchr)
        {
            $this->dat[] = new Letter($uchr);
        }
        return $this;
    }

    public function mb_str_split($ustr)
    {
        return preg_split('/(?<!^)(?!$)/u', $ustr);
    }

    public function render($br = PHP_EOL)
    {
        foreach ($this->dat as $poz => $chr)
        {
            echo $br;
            echo str_pad($poz, 4, ' ', STR_PAD_LEFT);
            echo ' ';
            echo $chr->chr;
            echo "\t";
            echo implode(null, $chr->hex);
        }
        echo $br;
    }
}

// DEMONSTRATE IT WITH THE REQUEST ARGUMENT
echo '<meta charset="utf-8" />';
echo '<pre>';

$q = !empty($_GET['q']) ? $_GET['q'] : 'Vöila';
var_dump($q);

$y = new Hexdump($q);
$y->render();

Open in new window

1
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:shafer23
ID: 41879226
Thanks for the additional thoughts; I have been caught up with a few other things and will get back to this in a few days - I will let you know what my final approach is.  thanks again!
0
 

Author Comment

by:shafer23
ID: 41888123
Ray - I used a variant of your suggestion and simply removed the single whitespace characters, this seems to eliminate the issue that was being caused by the end of line characters.  Thanks again for the feedback!


$rgx
= '#'        // REGEX DELIMITER
. '\s'       // SINGLE WHITESPACE CHARACTERS
. '#'        // REGEX DELIMITER
;

// convert whitespace
$input1 = trim(preg_replace($rgx, ' ', $input1));
$input2 = trim(preg_replace($rgx, ' ', $input2));

// determine if strings are same
if ($input1 == $input2)      $same = true;
else                              $same = false;
0
 

Author Comment

by:shafer23
ID: 41893981
Hi - not sure if anyone is still looking at this, but I spoke too soon in providing the above solution.  If I use the approach mentioned above and simply remove a single whitespace character (via the preg_replace) routine and the EOL character is \r\n (which is the case since I am using a windows test server), it does not equate that to data that I load directly into the database (which uses \n for line breaks) because of the additional \r

If, however, I also replace consecutive whitespace characters as well (by inserting \s\s+ | \s within the preg_replace function), it does not recognize that a change has been made if, for example, I insert an additional line break within the data displayed within a text control.  

I also tried the below statement to replace the EOL with \n prior to saving it, but that did not produce the desired result.
$streetaddress = str_replace(PHP_EOL, "\n", $streetaddress)

Ray - not sure if you are still looking at this, but I see that you provided some additional logic within your above discussion of the bin2hex statement.  Frankly, I did not follow all of it, but is that the approach that I should be using to compare two strings?

Thanks,
Pete
0
 

Author Comment

by:shafer23
ID: 41893989
I made a few changes to my approach and below seems to work fine:

$rgx
= '#'        // REGEX DELIMITER
. '\R'       // LINE BREAKS
. '#'        // REGEX DELIMITER
;

// convert line breaks
$input1 = trim(preg_replace($rgx, ' /n', $input1));
$input2 = trim(preg_replace($rgx, ' /n', $input2));

// determine if strings are same
if ($input1 == $input2)    $same = true;
else                                   $same = false;


Thanks again for your help, and please feel free to add any comments if you think there are any issues with this approach.

Pete
0
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 41893990
Not sure I am understanding the problem but can you give us a sample of the data - what it looks like and what you want it to look like.
To get rid of the \r you can do something like this preg_relplace('/\r\n/', '\n', $str); but without seeing your data that is only a guess.
1
 

Author Comment

by:shafer23
ID: 41893998
Hi - thanks, the data contains both \r\n and \n

I had previously tried replacing whitespace by using the below statements but I was not getting the desired result:
preg_replace('/\s/', ' ', $str)
preg_replace('/\s\s+|\s/', ' ', $str)

But, yes, consistent with what you had mentioned, I think that the below approach works fine.
preg_replace('/\R/', '\n', $str) - note, \R apparently replaces all variations of line breaks.

Thanks again,
Pete
0
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 41894009
Not sure you want to do this
preg_replace('/\R/', '\n', $str);

Open in new window

That will create double line breaks
More like
preg_replace('/\r/', '', $str);

Open in new window

Alterntively look at your import process to make it flexible about \n  vs \r\n
0
 

Author Comment

by:shafer23
ID: 41894346
Hi - thanks, my understanding is that replacing \R (as opposed to \r) with \n will replace all line breaks with a single line break, so it seems like the appropriate approach.

i.e. it replaces \r\n, \r and \n with \n, such that if data is imported from different servers (Mac, Windows, or Unix), the line breaks will be normalized such that two strings can be compared.
0
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 41894546
You are correct
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

Envision that you are chipping away at another e-business site with a team of pundit developers and designers. Everything seems, by all accounts, to be going easily.
This article discusses how to create an extensible mechanism for linked drop downs.
Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now