Link to home
Create AccountLog in
Avatar of tel2
tel2Flag for New Zealand

asked on

Perl regex to replace within boundaries

Hi Experts,

I'm using Perl and want to know if a single regex can be used to replace certain patterns if they appear between certain markers.  For example, taking this input data:

    1 a b c d e 2 b
    d

I want to convert all lowercase "b"s and "d"s with upper case, but only those which appear between the "1" and "2" markers.  So, I'm wanting this output:

    1 a B c D e 2 b
    d

I'd like to do this in a Perl one-liner, and here's what I've got so far:

    perl -pe 's/(?<=1)(.+?)(b|d)(.+)(?=2)/$1.uc($2).$3/msge' infile

But that gives this output:

    1 a B c d e 2 b
    d

As you can see, the "b" between "1" & "2" has been capitalised, but the "d" has not.

Can this be done with just a regex, or am I going to need more (e.g. a while loop)?

Thanks.
tel2
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

@kaufmed, would you please explain how that works? Thanks!

Update: his solution appears to have disappeared...
Damn Terry. You are too freakin' quick for me  ; )

It's close. I have to adjust it a bit. I'll explain once it's done  : )
Actually, let me ask for a clarification:  Can there be any b's or d's before the "1"? Perhaps my previous offering will work.
Avatar of tel2

ASKER

Hi kaufmed,

They can be before and after, but they should not be matched/replaced unless they appear between the "1" & "2".

Thanks.
I can't check it right now but I think this will almost work:
perl -pe 's/(?<=1)(?:(.*?)(b|d)(.*?))+(?=2)/$1.uc($2).$3/msge' infile

Open in new window


The problem is that $x are not consistent any more.  When I'm at a computer later, I'll see what revision I can come up with (probably using named matches).
This appears to work:

perl -pe 's/((?:[^1]*1|\G)[^2bd]*)([bd])/$1.uc($2)/ge' infile

Open in new window


The "[^1]*1" should match everything from the beginning of the string until the 1, and then including the 1. Then we match zero or more of any characters that are not a "2", a "b", or a "d". Then we match a "b" or a "d". That gets replaced. Unfortunately, at this point I'm not fully sure how to explain what "\G" is doing other than to say the pattern doesn't appear to work if I remove it. I understand the intent of "\G," but not when used inline in this manner. Including it effectively tells the regex engine that if we cannot find the "[^1]*1", then we should expect to pick up at the last match, and attempt to match the rest of the pattern. The process basically repeats until we hit the first "2".
Avatar of tel2

ASKER

Thanks for that, wilcoxon.
You're right - it did almost work.  Your code:
    perl -pe 's/(?<=1)(?:(.*?)(b|d)(.*?))+(?=2)/$1.uc($2).$3/msge' infile
gave me this:
    1 c D e 2 b
    d
Just lost the "a B".
When you say "$x", you're talking about $1-$3, right?  Why is $x not consistent anymore?
Do named matches work on Perl 5.10?  That's what I'm using.

And thanks for your work, kaufmed.
When I run yours:
    perl -pe 's/((?:[^1]*1|\G)[^2bd]*)([bd])/$1.uc($2)/ge' infile
I get this:
    1 a B c D e 2 b
    D
which is perfect except for the "D" at the end which should not have been replaced.  You've obviously put a lot of effort in kaufmed, and at the risk of making you cry into your keyboard, would this be a bad time to say that the example I gave is not as complex as the real data will be, because the real data will have words, instead of single chars.  Sorry I didn't make this clear originally when I wrote "patterns" which should probably have been "strings".  I'd totally understand if you give up at this point, and thanks for your efforts.
I suppose it would be better to say that with the regex I gave, $1-$3 are consistent but there is also the potential for $4-$6, $7-$9, etc since the group around the capture is repeated.  I'll relook at the regex later tonight or tomorrow and see if I can fix it.
Avatar of tel2

ASKER

OK - thanks wilcoxon.
ASKER CERTIFIED SOLUTION
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
which is perfect except for the "D" at the end which should not have been replaced.
OK, I misunderstood your data. Let's change the flags up a bit:

perl -0 -pe 's/((?:[^1]*1|\G)[^2bd]*)([bd])/$1.uc($2)/ge' infile

Open in new window


at the risk of making you cry into your keyboard, would this be a bad time to say that the example I gave is not as complex as the real data will be, because the real data will have words, instead of single chars.
Believe me, I've been doing this long enough to know when someone has simplified a regex problem in the interest of explanation. I had a pretty good feeling that "b" and "d" were going to end up being words.
Avatar of tel2

ASKER

Hi guys,
Before I opened this question, I had used the '-0' (or offically '-0777') Perl switch to slurp the whole file, but I forgot to include this when I posted the question.  Looks as if kaufmed has decided it is useful to have it in this situation.  Potentially the start and end markers could appear on separate lines, so '-0' makes it easier to handle that situation.


Hi wilcoxon,
Have you tested your latest solution?
    perl -pe 'if (/^(.*1)(.*?)(2.*)$/) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
gives me this output:
    1 a B c D e 2 b  d
which is very close, except the lines have been joined.
If I add the '-0' switch, and '/s' modifier:
    perl -0 -pe 'if (/^(.*1)(.*?)(2.*)$/s) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
I get this:
  1 a B c D e 2 b
  d
which is correct!
Now if I change the input to this:
    two
    four Start four one two three two four five End two
    four
and change your script accordingly:
    perl -0 -pe 'if (/^(.*Start)(.*?)(End.*)$/s) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(two|four)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
I get this:
    two
    four Start Four one Two three Two Four five End two
    four
Which is also correct, thanks!


Hi kaufmed,
Good work so far - your script seems to work with the sample data I've supplied.
I'm glad to hear you had a good idea that "b" and "d" were going to end up being words.  So now the big question is, how can your script be changed to handle words, eg:
    two
    four Start four one two three two four five End two
    four
In the above case, "two" and "four" should be capitalised, but only those that appear between "START" and "END", so the output should be:
    two
    four Start Four one Two three Two Four five End two
    four
Your use of character classes (e.g. [^2bd]) makes this look tricky to me.
Anyway, we seem to have a solution (i.e. wilcoxon's with a couple of minor tweaks), so I don't need your solution to work anymore, but feel free if you're keen to show how it could work.

Thanks.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of tel2

ASKER

Thanks kaufmed,
It seems to me that it would be quite complex for me to change something like that to handle input data like this:
    two
    four Start four one two three two four five End two
    four
where "Start" & "End" are the markers, and "two" and "four" are the words to capitalise.
Agreed?
I think the overall complexity of the pattern warrants breaking the logic up into something more readable  : )
Avatar of tel2

ASKER

Me too.  I think I'll go with the solution wilcoxon supplied + my minor adjustments.
But I appreciate all your efforts, kaufmed.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of tel2

ASKER

Thank you both.