tel2
asked on
Perl regex to replace within boundaries
Hi Experts,
I'm using Perl and want to know if a single regex can be used to replace certain patterns if they appear between certain markers. For example, taking this input data:
1 a b c d e 2 b
d
I want to convert all lowercase "b"s and "d"s with upper case, but only those which appear between the "1" and "2" markers. So, I'm wanting this output:
1 a B c D e 2 b
d
I'd like to do this in a Perl one-liner, and here's what I've got so far:
perl -pe 's/(?<=1)(.+?)(b|d)(.+)(?= 2)/$1.uc($ 2).$3/msge ' infile
But that gives this output:
1 a B c d e 2 b
d
As you can see, the "b" between "1" & "2" has been capitalised, but the "d" has not.
Can this be done with just a regex, or am I going to need more (e.g. a while loop)?
Thanks.
tel2
I'm using Perl and want to know if a single regex can be used to replace certain patterns if they appear between certain markers. For example, taking this input data:
1 a b c d e 2 b
d
I want to convert all lowercase "b"s and "d"s with upper case, but only those which appear between the "1" and "2" markers. So, I'm wanting this output:
1 a B c D e 2 b
d
I'd like to do this in a Perl one-liner, and here's what I've got so far:
perl -pe 's/(?<=1)(.+?)(b|d)(.+)(?=
But that gives this output:
1 a B c d e 2 b
d
As you can see, the "b" between "1" & "2" has been capitalised, but the "d" has not.
Can this be done with just a regex, or am I going to need more (e.g. a while loop)?
Thanks.
tel2
Damn Terry. You are too freakin' quick for me ; )
It's close. I have to adjust it a bit. I'll explain once it's done : )
It's close. I have to adjust it a bit. I'll explain once it's done : )
Actually, let me ask for a clarification: Can there be any b's or d's before the "1"? Perhaps my previous offering will work.
ASKER
Hi kaufmed,
They can be before and after, but they should not be matched/replaced unless they appear between the "1" & "2".
Thanks.
They can be before and after, but they should not be matched/replaced unless they appear between the "1" & "2".
Thanks.
I can't check it right now but I think this will almost work:
The problem is that $x are not consistent any more. When I'm at a computer later, I'll see what revision I can come up with (probably using named matches).
perl -pe 's/(?<=1)(?:(.*?)(b|d)(.*?))+(?=2)/$1.uc($2).$3/msge' infile
The problem is that $x are not consistent any more. When I'm at a computer later, I'll see what revision I can come up with (probably using named matches).
This appears to work:
The "[^1]*1" should match everything from the beginning of the string until the 1, and then including the 1. Then we match zero or more of any characters that are not a "2", a "b", or a "d". Then we match a "b" or a "d". That gets replaced. Unfortunately, at this point I'm not fully sure how to explain what "\G" is doing other than to say the pattern doesn't appear to work if I remove it. I understand the intent of "\G," but not when used inline in this manner. Including it effectively tells the regex engine that if we cannot find the "[^1]*1", then we should expect to pick up at the last match, and attempt to match the rest of the pattern. The process basically repeats until we hit the first "2".
perl -pe 's/((?:[^1]*1|\G)[^2bd]*)([bd])/$1.uc($2)/ge' infile
The "[^1]*1" should match everything from the beginning of the string until the 1, and then including the 1. Then we match zero or more of any characters that are not a "2", a "b", or a "d". Then we match a "b" or a "d". That gets replaced. Unfortunately, at this point I'm not fully sure how to explain what "\G" is doing other than to say the pattern doesn't appear to work if I remove it. I understand the intent of "\G," but not when used inline in this manner. Including it effectively tells the regex engine that if we cannot find the "[^1]*1", then we should expect to pick up at the last match, and attempt to match the rest of the pattern. The process basically repeats until we hit the first "2".
ASKER
Thanks for that, wilcoxon.
You're right - it did almost work. Your code:
perl -pe 's/(?<=1)(?:(.*?)(b|d)(.*? ))+(?=2)/$ 1.uc($2).$ 3/msge' infile
gave me this:
1 c D e 2 b
d
Just lost the "a B".
When you say "$x", you're talking about $1-$3, right? Why is $x not consistent anymore?
Do named matches work on Perl 5.10? That's what I'm using.
And thanks for your work, kaufmed.
When I run yours:
perl -pe 's/((?:[^1]*1|\G)[^2bd]*)( [bd])/$1.u c($2)/ge' infile
I get this:
1 a B c D e 2 b
D
which is perfect except for the "D" at the end which should not have been replaced. You've obviously put a lot of effort in kaufmed, and at the risk of making you cry into your keyboard, would this be a bad time to say that the example I gave is not as complex as the real data will be, because the real data will have words, instead of single chars. Sorry I didn't make this clear originally when I wrote "patterns" which should probably have been "strings". I'd totally understand if you give up at this point, and thanks for your efforts.
You're right - it did almost work. Your code:
perl -pe 's/(?<=1)(?:(.*?)(b|d)(.*?
gave me this:
1 c D e 2 b
d
Just lost the "a B".
When you say "$x", you're talking about $1-$3, right? Why is $x not consistent anymore?
Do named matches work on Perl 5.10? That's what I'm using.
And thanks for your work, kaufmed.
When I run yours:
perl -pe 's/((?:[^1]*1|\G)[^2bd]*)(
I get this:
1 a B c D e 2 b
D
which is perfect except for the "D" at the end which should not have been replaced. You've obviously put a lot of effort in kaufmed, and at the risk of making you cry into your keyboard, would this be a bad time to say that the example I gave is not as complex as the real data will be, because the real data will have words, instead of single chars. Sorry I didn't make this clear originally when I wrote "patterns" which should probably have been "strings". I'd totally understand if you give up at this point, and thanks for your efforts.
I suppose it would be better to say that with the regex I gave, $1-$3 are consistent but there is also the potential for $4-$6, $7-$9, etc since the group around the capture is repeated. I'll relook at the regex later tonight or tomorrow and see if I can fix it.
ASKER
OK - thanks wilcoxon.
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
which is perfect except for the "D" at the end which should not have been replaced.OK, I misunderstood your data. Let's change the flags up a bit:
perl -0 -pe 's/((?:[^1]*1|\G)[^2bd]*)([bd])/$1.uc($2)/ge' infile
at the risk of making you cry into your keyboard, would this be a bad time to say that the example I gave is not as complex as the real data will be, because the real data will have words, instead of single chars.Believe me, I've been doing this long enough to know when someone has simplified a regex problem in the interest of explanation. I had a pretty good feeling that "b" and "d" were going to end up being words.
ASKER
Hi guys,
Before I opened this question, I had used the '-0' (or offically '-0777') Perl switch to slurp the whole file, but I forgot to include this when I posted the question. Looks as if kaufmed has decided it is useful to have it in this situation. Potentially the start and end markers could appear on separate lines, so '-0' makes it easier to handle that situation.
Hi wilcoxon,
Have you tested your latest solution?
perl -pe 'if (/^(.*1)(.*?)(2.*)$/) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
gives me this output:
1 a B c D e 2 b d
which is very close, except the lines have been joined.
If I add the '-0' switch, and '/s' modifier:
perl -0 -pe 'if (/^(.*1)(.*?)(2.*)$/s) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
I get this:
1 a B c D e 2 b
d
which is correct!
Now if I change the input to this:
two
four Start four one two three two four five End two
four
and change your script accordingly:
perl -0 -pe 'if (/^(.*Start)(.*?)(End.*)$/ s) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(two|four)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
I get this:
two
four Start Four one Two three Two Four five End two
four
Which is also correct, thanks!
Hi kaufmed,
Good work so far - your script seems to work with the sample data I've supplied.
I'm glad to hear you had a good idea that "b" and "d" were going to end up being words. So now the big question is, how can your script be changed to handle words, eg:
two
four Start four one two three two four five End two
four
In the above case, "two" and "four" should be capitalised, but only those that appear between "START" and "END", so the output should be:
two
four Start Four one Two three Two Four five End two
four
Your use of character classes (e.g. [^2bd]) makes this look tricky to me.
Anyway, we seem to have a solution (i.e. wilcoxon's with a couple of minor tweaks), so I don't need your solution to work anymore, but feel free if you're keen to show how it could work.
Thanks.
Before I opened this question, I had used the '-0' (or offically '-0777') Perl switch to slurp the whole file, but I forgot to include this when I posted the question. Looks as if kaufmed has decided it is useful to have it in this situation. Potentially the start and end markers could appear on separate lines, so '-0' makes it easier to handle that situation.
Hi wilcoxon,
Have you tested your latest solution?
perl -pe 'if (/^(.*1)(.*?)(2.*)$/) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
gives me this output:
1 a B c D e 2 b d
which is very close, except the lines have been joined.
If I add the '-0' switch, and '/s' modifier:
perl -0 -pe 'if (/^(.*1)(.*?)(2.*)$/s) { ($pre, $dat, $post) = ($1, $2, $3); $dat =~ s{(b|d)}{\u$1}g; $_ = "$pre$dat$post"; }' infile
I get this:
1 a B c D e 2 b
d
which is correct!
Now if I change the input to this:
two
four Start four one two three two four five End two
four
and change your script accordingly:
perl -0 -pe 'if (/^(.*Start)(.*?)(End.*)$/
I get this:
two
four Start Four one Two three Two Four five End two
four
Which is also correct, thanks!
Hi kaufmed,
Good work so far - your script seems to work with the sample data I've supplied.
I'm glad to hear you had a good idea that "b" and "d" were going to end up being words. So now the big question is, how can your script be changed to handle words, eg:
two
four Start four one two three two four five End two
four
In the above case, "two" and "four" should be capitalised, but only those that appear between "START" and "END", so the output should be:
two
four Start Four one Two three Two Four five End two
four
Your use of character classes (e.g. [^2bd]) makes this look tricky to me.
Anyway, we seem to have a solution (i.e. wilcoxon's with a couple of minor tweaks), so I don't need your solution to work anymore, but feel free if you're keen to show how it could work.
Thanks.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thanks kaufmed,
It seems to me that it would be quite complex for me to change something like that to handle input data like this:
two
four Start four one two three two four five End two
four
where "Start" & "End" are the markers, and "two" and "four" are the words to capitalise.
Agreed?
It seems to me that it would be quite complex for me to change something like that to handle input data like this:
two
four Start four one two three two four five End two
four
where "Start" & "End" are the markers, and "two" and "four" are the words to capitalise.
Agreed?
I think the overall complexity of the pattern warrants breaking the logic up into something more readable : )
ASKER
Me too. I think I'll go with the solution wilcoxon supplied + my minor adjustments.
But I appreciate all your efforts, kaufmed.
But I appreciate all your efforts, kaufmed.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thank you both.
Update: his solution appears to have disappeared...