Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 546
  • Last Modified:

preg_match bug?

I have been trying to figure out how to replace some text within a large string using preg_match, however I can't get it working.  
<?php
ini_set("memory_limit",'1024M'); 
ini_set("max_execution_time",'5000');
ini_set("display_errors",true);
ini_set("pcre.backtrack_limit", 100000);
ini_set("pcre.recursion_limit", 100000);
error_reporting(E_ALL);
$sString    =file_get_contents("text.txt"); 
$sString    =preg_replace('~(\\\r\\\n){3,}~','\r\n',$sString);
 var_export($sString);
?>

Open in new window


I'm running it on a server H Proliant MML350 G5 with 5GB Memory/XEON 2.33GHz with Linux OpenSUSE 11.1 x64.

PHP version:  5.2.13
preg.zip
0
Ludwig Diehl
Asked:
Ludwig Diehl
  • 6
  • 5
  • 5
  • +1
1 Solution
 
sjklein42Commented:
Don't know if this will work, but some notes.

Use double quotes to cause "interpolation" (translation) of the \r and \n in the replacement string.  (I'm guessing you want real newlines, but I may be wrong).  Single quotes does not interpolate.

The first argument to preg_replace can also be a pattern (as opposed to a string which is what you were passing).  I think this will work better.  I'm guessing you want to find runs of at least three \r\n in the string and replace with a true newline?

$sString    =preg_replace(/(\\r\\n){3,}/,"\r\n",$sString);

Open in new window

0
 
Ray PaseurCommented:
Could you do us a favor, please.  Post some of the original data, and post some of the desired output.  Like what was in "text.txt" and what did you expect to find in $sString after the processing completed.  Armed with that and a good explanation of your rules for data transformation we will probably be able to show you the code that will achieve your objective.
0
 
Terry WoodsIT GuruCommented:
Also, is it a one off job? If so, using Perl could be a backup option. I've run Perl regex substitution scripts over files gigabytes in size with far more complicated patterns, and the performance and reliability was excellent.
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
Ray PaseurCommented:
@TerryAtOpus:  Just between you and me, we might want to use the PHP function nl2br() or the strip_tags() function or some combination of replacement of PHP_EOL with NULL.  It's easier when we see the input and are told about the desired output, eh!
0
 
Terry WoodsIT GuruCommented:
Ray, I agree. And generally it's better to avoid trying to learn a new tool if possible, but it's also reassuring to know that there's a viable backup option - I've seen people spend days on problems that didn't really need to be solved as they could be worked around easily.
0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
Just try this simple example:

ini_set("memory_limit",'3024M'); 
ini_set("max_execution_time",'5000');
ini_set( 'pcre.backtrack_limit', 10000000);
ini_set( 'pcre.recursion_limit', 10000000);
$nMultiplier=6371;
$sString=str_repeat('\r\n',$nMultiplier).'This is the first string I want to get'.str_repeat('\r\n',$nMultiplier).'This is another string I want'.str_repeat('\r\n',$nMultiplier).'This is the last string';
echo 'Length: '.strlen($sString).'<br/>'.$sString,'<hr/>';
$sString    =preg_replace('~(\\\r\\\n){3,}~',"\r\n",$sString);               
echo $sString,'<br/>';

Open in new window


now, if you increment $nMultiplier by 1 you'll notice that it doesn't work. I have tried using ere_preg and it does work, but I want to use preg_replace.
The thing is that wrongly data was stored onto database double escaped and thus "\r\n" and in some cases several thousand hundreds of such string were stored. I want to replace them where pattern is matched.
0
 
sjklein42Commented:
I post this again - I believe you should be using a pattern, not a string, as the first argument to preg_replace. Start with something simple like this, which should replace all literal \r\n with newlines.  If this works, we can deal with collapsing the long runs.

 $sString    =preg_replace(/\\r\\n/,"\n",$sString); 

Open in new window

0
 
Ray PaseurCommented:
@ludwigDiehl:  Let me try this again.

"Could you do us a favor, please.  Post some of the original data, and post some of the desired output.  Like what was in "text.txt" and what did you expect to find in $sString after the processing completed.  Armed with that and a good explanation of your rules for data transformation we will probably be able to show you the code that will achieve your objective."

It might be as simple as something like this little function.  Example on the web here:
http://www.laprbass.com/RAY_temp_ludwigdiehl.php
<?php // RAY_temp_ludwigdiehl.php
error_reporting(E_ALL);


// SHOW HOW TO GET BACK TO ONE NORMAL \r\n SEQUENCE


// SOME TEST DATA
$str = "This is a test string with \r\n\r\n\r\n\r\n\r\r\r\r\n\n\n too many strange CR/LF sequences";

// SHOW THE TEST DATA PREFORMATTED SO WE CAN SEE THE SEQUENCES
echo "<pre>";
echo $str;
echo PHP_EOL;
echo PHP_EOL;


// FIX THE STRING
$new = fix_str($str);
echo $new;
echo PHP_EOL;
echo PHP_EOL;



// A FUNCTION TO RESTORE SANITY TO TOO MANY CR/LF SEQUENCES
function fix_str($s)
{
    // WILL STOP IF THE STRING GOES EMPTY
    while ($s)
    {
        // WILL STOP IF THE STRING IS SANE
        $retry = FALSE;

        // REMOVE DOUBLED EOL CHARACTERS
        if (strpos($s, "\n\n") !== FALSE)
        {
            $retry = TRUE;
            $s = str_replace("\n\n", "\n", $s);
        }

        // REMOVE DOUBLED CR CHARACTERS
        if (strpos($s, "\r\r") !== FALSE)
        {
            $retry = TRUE;
            $s = str_replace("\r\r", "\r", $s);
        }

        // REMOVE DOUBLED WINDOWS CR/EOL CHARACTER SETS
        if (strpos($s, "\r\n\r\n") !== FALSE)
        {
            $retry = TRUE;
            $s = str_replace("\r\n\r\n", "\r\n", $s);
        }

        // SHOULD WE STOP OR TRY MORE REDUCTIONS
        if (!$retry) break;
    }

    return $s;
}

Open in new window

0
 
Ray PaseurCommented:
PS: you might also want to trim($s) before line 60.
0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
sjklein42: the given example does not work at all. It expects a string..

Ray_Paseur: Thx for the proposal, but I want to use preg_match and also know why is this happening.
As I said before, I tried using ereg_replace and it works perfectly so, no need to use and alternate method, thanks anyway for your example.
0
 
Ray PaseurCommented:
OK, so are you not going to post the test data and the desired results set?  It would be so much easier for you to get a timely answer if you posted that.

If you want a solution, choose the code at ID:35085860.  It is a fully tested solution to the data-related problems.  It works.

And if it's just your personal learning exercise, best of luck learning to use preg_match.  Since you have a solution it's appropriate to close this question now.

Over and out, ~Ray
0
 
sjklein42Commented:
Sorry.  Once more with feeling.  Made it a string and a pattern!

 $sString    = preg_replace("/\\r\\n/", "\n", $sString);  

Open in new window

0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
Ray_Paseur:  post ID: 35084639 reproduce exactly test data. I don't want to post the whole text as it is a 65K-character string. In my first post I didn't ask for an alternate solution. I said "I can't get it working".
Again, as I said before I have the solution using ereg or something like what u posted, however with this I'm trying to figure out what is happening. I've been using preg_match for more than 5 years and never have a problem like this so that's the point. Thanks either way.

sjklein42: it doesn't work ;)
0
 
Terry WoodsIT GuruCommented:
Trying your example code worked fine, even when I added 1 to nMultiplier. Adding 1000 however caused a seg fault when running from the linux command line, but not when run through Apache (it still worked).

This fixed it from the command line:
$sString    = preg_replace("/\\\\r\\\\n/", "\r\n", $sString);
$sString    = preg_replace("/(\r\n){3,1000}/", "\r\n", $sString);
$sString    = preg_replace("/(\r\n){3,}/", "\r\n", $sString);
0
 
Terry WoodsIT GuruCommented:
The {3,} repeat seems to be the cause of the problem. Giving it a limit of 1000 repetitions obviously prevented the problem, but it's hardly an elegant solution.
0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
Yes your example indeed work, however that's not supposed to be a documented preg_match limitation. Moreover, is it not PCRE(preg) supposed to be better in performance than POSIX (ereg)?.
0
 
Terry WoodsIT GuruCommented:
I'd guess yes, and I know that the ereg functions are now deprecated, but I don't have much knowledge on the limits of PCRE. (I think your original question has at least been answered!)
0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
you are right my friend!
0
 
Ludwig DiehlSystems ArchitectAuthor Commented:
Solution is not exactly what I was looking for, however it still uses preg_match
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 6
  • 5
  • 5
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now