Link to home
Start Free TrialLog in
Avatar of ext2
ext2

asked on

regex: $x =~ /\G/gc and pos ($x)

==== example: =====
use strict;
use Data::Dumper;

my $x = 'asdf';

print Dumper(pos $x);
print "OK\n" if $x =~ /\G/gc;
print Dumper(pos $x);
print "OK\n" if $x =~ /\G/gc;
print Dumper(pos $x);
pos $x = pos $x;
print "OK\n" if $x =~ /\G/gc;


==== actual output:=====
$VAR1 = undef;
OK
$VAR1 = '0';
$VAR1 = '0';
OK

===== output I expect: =====
$VAR1 = undef;
OK
$VAR1 = '0';
OK                   <---- NOTE
$VAR1 = '0';
OK
===== end =====

What is going on?  And why does "pos $x = pos $x;" seem to have a side effect?
Avatar of holli
holli

because pos $x = pos $x is an assignment not an comparison. what you need is pos $x == pos $x
What are you trying to match?
\G is used AFTER you've matched something (in a loop, usually) and you want to get what comes after it. In your snippet none of your regexps match anything in $x.
Avatar of ext2

ASKER

The code is correct.  It is not intended to be useful, but it is a test case for regex behavior.

After having reread the sections on pos, \G, and /../gc in Mastering Regular Expressions, 2nd ed, Jeffrey Friedl (pp.~313, 129), what I think is occuring is a somewhat obscure "forced bump-ahead" behavior of /../g that is invoked by Perl to prevent an infinite loop (p. 129).  For example,

  my $x = 'abcde';
   $x =~ s/x?/!/g;
   print $x;

does in fact complete.  It prints "!a!b!c!d!e!".

On the other hand,

   my $x = 'abcde';
   $x =~ s/\Gx?/!/g;
   print $x;

prints "!abcde".  This is because \G matches only at the end of the previous match rather than the beginning of the next match.  When the forced bump-ahead occurs (as it does here), these two locations are not equivalent, so the match on \G fails for all but the first iteration.

The question I then have is what exactly does pos($x) mean under such circumstances?  Consider:

  my $x = 'abcde';
  pos($x) = 0; # just to be sure (not needed)
  $x =~ s/(?{ print "A" . pos($x) })\G(?{ print "B" . pos($x) })x?/!/g;
  print $x;

prints "A0B0A0B0A1A2A3A4A5!abcde".

It's interesting that the first and second iterations of the loop give exactly the same results "A0B0".  Therefore, the behavior of the regex must be depending on *some internal state* other than pos($x).

If you don't think it's valid to put something before the "\G", then try this:

  my $x = 'abcde';
  $x =~ s/\G(?{ print "B" . pos($x) })x?/!/g;
  print $x;

which prints "B0B0!abcde".

How about this:

  my $x = 'abcde';
  while($x =~ /\G(?{ print "B" . pos($x) })x?/g) {
     # pos($x) = pos($x);
     print '*';
  }
  print $x;

This prints "B0*B0abcde".  However, disable that comment character, and the loop is infinite.  Therefore, pos($x) = pos($x) actually does have some effect, and it is resetting some internal variable.
Avatar of ext2

ASKER

If anyone really wants to know why this issue came up, well, I'm writing a simpler lexer, and it happened that a test cases failed if I added the code

  $x =~ /\G/gc;

to the very beginning of the program.  I did this because I was concerned that pos($x) was undefined rather than zero.  This code caused the expected side effect of setting pos($x) equal to zero.  However, it also suddenly caused the test cases to fall into an infinite loop.  As a simplified example,

  my $x = '';

  # $x =~ /\G/gc;
  while(not $x =~ /\G\z/gc) {
    print "pos=", pos($x), "\n";
    if($x =~ /\G([a-z])/gc) { print "$1"; }
    else {
      $x =~ /\G[^a-z]+/gc;
    }
  }

becomes an infinite loop printing "pos=0" if the comment character is removed.

The obvious solution is to "not do that" and instead use the much more obvious "pos(x) = 0", which does not have the unintended side-effect.  However, I'm interested in what's really going on here.
Avatar of ext2

ASKER

correction: pos($x) not pos(x)
The truth is that in every change of the string  (creation, assignment to it, and so on), pos resets to undef which is the begining of the string. So, even the assignment of pos to 0 at the beginning is reduindant.

About your question.
pos itself remisn unchanged, only \G is getting bumped-up one char. You have to remember that pos is getting assigned only in case of a succesful match, so there is no reason to update it in case of failure.
as \G is part of the regexp engine, you have to bump it in order to ignore infinite loops, but pos is not of the engine, and so updateing it is useless.
Avatar of ext2

ASKER

roee_f,

I reread the sections in Mastering Regular Expressions again ;)

One thing is that the text doesn't seem to ever suggest that \G bumps along.  Rather, it mentions that \G is the location of the end of the previous match (regardless of bumping), and pos is the thing that can bump along.  So, I wrote this test:

  my $x = "abcde";
  $x =~ /a/gc;
  print pos($x);
  print "A" if $x =~ /\G(?{print 'X'})/gc; print pos($x);
  print "B" if $x =~ /\G(?{print 'Y'})/gc; print pos($x);
  print "C" if $x =~ /\G(?{print 'Z'})/gc; print pos($x);
  $x =~ /\G(.)/gc and print $1;

This prints "1XA1Y1Z1b".  This implies that in lines 4, 5, and 6, both \G and pos remain at string index 1 and do not bump along.  So, something else must be causing the behavior of line 5 and 6.

Reaching into the forgotten regex debugger:

  perl -Mre=debug testre.pl

Produces

===
...
Guessing start of match, REx `a' against `abcde'...
Found anchored substr `a' at offset 0...
Guessed: match at offset 0
Matching REx `\G(?{print 'X'})' against `bcde'
  Setting an EVAL scope, savestack=17
   1 <a> <bcde>           |  1:  GPOS
   1 <a> <bcde>           |  2:  EVAL
  re_eval 0x10146480
   1 <a> <bcde>           |  4:  END
Match successful!
Matching REx `\G(?{print 'Y'})' against `bcde'
  Setting an EVAL scope, savestack=17
   1 <a> <bcde>           |  1:  GPOS
   1 <a> <bcde>           |  2:  EVAL
  re_eval 0x10146540
   1 <a> <bcde>           |  4:  END
Match possible, but length=0 is smaller than requested=1, failing!
  Clearing an EVAL scope, savestack=17..20
Match failed
Matching REx `\G(?{print 'Z'})' against `bcde'
  Setting an EVAL scope, savestack=17
   1 <a> <bcde>           |  1:  GPOS
   1 <a> <bcde>           |  2:  EVAL
  re_eval 0x10146600
   1 <a> <bcde>           |  4:  END
Match possible, but length=0 is smaller than requested=1, failing!
  Clearing an EVAL scope, savestack=17..20
Match failed
Matching REx `\G(.)' against `bcde'
  Setting an EVAL scope, savestack=5
   1 <a> <bcde>           |  1:  GPOS
   1 <a> <bcde>           |  2:  OPEN1
   1 <a> <bcde>           |  4:  REG_ANY
   2 <ab> <cde>           |  5:  CLOSE1
   2 <ab> <cde>           |  7:  END
Match successful!
...
===

So, what is happening is that once a regex matches a string of length zero, the next match is required to be of length at least one.  *Right after the next match completes*, the length of the match is checked.  No bump along occurs here.  If the length is zero again, the return value of this match is overriden to be false.  Secondly, doing "pos($x) = pos($x)" seems to be an obscure way to reset this condition.

Avatar of ext2

ASKER

probably a refund--answered my own question.
ASKER CERTIFIED SOLUTION
Avatar of modulo
modulo

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial