Link to home
Start Free TrialLog in
Avatar of DrDamnit
DrDamnitFlag for United States of America

asked on

preg_match_all causing children to die.

I have a TinyMCE form that accepts copy / paste of images. When you're done putting in your comment, you press "Post" and it submits the form, the content is parsed, and displayed in the on-screen conversation.

I am trying to use preg_match_all to pull the base64 encoded data from the POSTed vars. Most of my patterns work, but this one is causing Apache to crash.

<img src="data:image/(png|PNG|gif|GIF|jpg|JPG|jpeg|JPEG);base64,([a-zA-Z0-9+/=])*

Open in new window


The crash is silent, and the only hint I am getting from Apache / PHP is a single line in the error.log file:
[error] child died with signal 11

Open in new window


I have narrowed it down to this pattern in preg_match_all and the fact that I have a * after the second group, which consists of a class definition designed to follow the base64 characters to their termination by a quote.

The sample image is attached below as a text file.

The only thing I can think of is that the "*" is too greedy and is consuming too much memory. But, there are two problems with that:

1. I increased the memory_limit in php.ini from 128M to 256M without a result, and
2. The file size is only 198K.

System:
Apache v2.2.22 on Debian Wheezy 7.9
PHP Version: 5.6.16 compiled from source using the following configure:

./configure --with-config-file-path=/etc/php5/apache2 \
--with-pear=/usr/share/php \
--with-bz2 \
--with-curl \
--with-gd \
--enable-calendar \
--enable-mbstring \
--enable-bcmath \
--enable-sockets \
--with-libxml-dir \
--with-mysqli \
--with-mysql \
--with-openssl \
--with-regex=php \
--with-readline \
--with-zlib \
--with-apxs2=/usr/bin/apxs2 \
--enable-soap \
--with-freetype-dir=/usr/include/freetype2/ \
--with-freetype \
--with-mcrypt=/usr/src/mcrypt-2.6.8 \
--with-jpeg-dir=/usr/lib/x86_64-linux-gnu/ \
--with-png-dir=/usr/lib/x86_64-linux-gnu/
base64.txt
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

I honestly have never been able to figure out what the "--with-regex" does, since PCRE regex is a default, core part of the engine. I would suggest re-compiling it without that option, especially since the configure options warns:

  --with-regex=TYPE       Regex library type: system, php. TYPE=php
                         WARNING: Do NOT use unless you know what you are doing!

You'll need to make clean before you re-configure and re-make.
ASKER CERTIFIED SOLUTION
Avatar of DrDamnit
DrDamnit
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Even if that gets you past that error, I would strongly suggest my previous suggestion anyway. A segfault is a sign of a more serious problem, and if you don't know why you selected that "--with-regex" option in your configure statement, you should just leave it out. If a regex is invalid, it should throw a proper error message or warning that you can catch and deal with.
As a general rule, the parentheses in regular expressions denote "capture groups" enabling us to pluck substring parts out of the matched expression string.  In the examples above, the capture group that looks like this:

(png|PNG|gif|GIF|jpg|JPG|jpeg|JPEG)

...would match one of the substrings like png, PNG, etc.

And the capture group like this:

([a-zA-Z0-9+/=]*)

...might be rewritten like this, to add comments that help clarify the intent:
$rgx
= '#'           // REGEX DELIMITER
. '('           // START CAPTURE GROUP
. '['           // START CHARACTER CLASS
. 'a-zA-Z0-9'   // ALPHA-NUMERIC
. '+/='         // SOME PUNCTUATION, MAYBE?
. ']'           // END OF CHARACTER CLASS
. '*'           // ANYTHING OR NOTHING
. ')'           // END OF CAPTURE GROUP
. '#'           // REGEX DELIMITER
;

Open in new window

The act of deconstructing the REGEX is useful because it allows us to ask certain questions that might be buried when the string is concatenated.  For example, the plus-sign is a regex meta-character usually meaning "1 or more of these characters."  But its role as a meta-character is probably negated by being inside the character class.  For something like this, I would deliberately code an escape character and add a comment to explain that I really wanted to match a literal plus sign.  And the presence of the asterisk at the end of the capture group says "zero or more of these characters" which may not make a lot of sense in the context of a permissive capture group.

Taken by itself, this subset of the REGEX will match a great many things!  I think this is because the REGEX engine is satisfied once the character class finds any matching character.  There is no restriction on count or ending delimiter.

I find that a test-driven approach to regular expressions is often helpful.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

Here are some sample test cases:
http://iconoun.com/demo/temp_drdammit.php
<?php // demo/temp_drdammit.php
/**
 * http://www.experts-exchange.com/questions/28897308/preg-match-all-causing-children-to-die.html
 *
 * http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html
 */
error_reporting(E_ALL);
echo '<pre>';

// A REGULAR EXPRESSION TO MATCH SOME CHARACTER STRINGS
$rgx
= '#'                     // REGEX DELIMITER
. '\<'                    // ESCAPED LITERAL LEFT-WICKET
. 'img src="data:image/'  // LITERAL STRING
. '('                     // START CAPTURE GROUP
. 'png|PNG|'              // LIST
. 'gif|GIF|'              //  OF
. 'jpg|JPG|'              //   ALTERNATIVE
. 'jpeg|JPEG'             //    SUFFIXES
. ')'                     // END CAPTURE GROUP
. ';base64,'              // LITERAL STRING
. '('                     // START CAPTURE GROUP
. '['                     // START CHARACTER CLASS
. 'a-zA-Z0-9'             // ALPHA-NUMERIC
. '+/='                   // SOME PUNCTUATION, MAYBE?
. ']'                     // END OF CHARACTER CLASS
. '*'                     // ANYTHING OR NOTHING
. ')'                     // END OF CAPTURE GROUP
. '#'                     // REGEX DELIMITER
;

// SHOW THE REGULAR EXPRESSION
echo PHP_EOL . '<b>REGEX: ' . htmlentities($rgx) . '</b>';

// SOME TEST CASES
$dat =
[ 'Hello World'
, '3+5=43'
, '<img src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjx0ZXh0IHg9IjUiIHk9IjE4IiBmb250LXNpemU9IjE4IiBmb250LWZhbWlseT0idmVyZGFuYSIgZm9udC13ZWlnaHQ9ImJvbGQiIGZpbGw9ImdyYXkiPlg8L3RleHQ+PC9zdmc+'
, '<img src="data:image/png;base64,FOOBAR" />'
, '<img src="data:image/PNG;base64,FOOBAR" />'
, '<img  src="data:image/png;base64,FOOBAR" />'
, '<img src="data:image/Png;base64,FOOBAR" />'
, '<img src="data:image/png;base64,FOOBAR+urst+wust+plud gort+ibbly=nonsense />'
, '<img src="data:image/png;base64,FOOBAR" alt="foobar" />'
, '<img alt="foobar" src="data:image/png;base64,FOOBAR" />'
]
;

// RUN SOME TESTS
foreach ($dat as $str)
{
    echo PHP_EOL;
    if (preg_match($rgx, $str, $mat))
    {
        echo PHP_EOL . '<b>REGEX MATCHES:</b>';
        echo PHP_EOL . htmlentities($str);
        print_rr($mat);
    }
    else
    {
        echo PHP_EOL . '<i>REGEX DOES NOT MATCH:</i>';
        echo PHP_EOL . htmlentities($str);
    }
    unset($mat);
}

function print_rr($arr)
{
    foreach ($arr as $key => $str)
    {
        echo PHP_EOL . '[' . htmlentities($key) . '] ';
        echo htmlentities($str);
    }
}

Open in new window

You may also want to see this highly informative cautionary statement about parsing XML or HTML with regular expressions! It's usually less brittle when the code employs a state engine instead of a regular expression.