Perl RegExp to convert linebreaks to HTML P tags

Here's a doozie for the Perl experts. Please read carefully, because there are some nuances.

I would like to be able to go through a multiline field, replacing newlines with <p> tags, by enclosing the relevant line within <p> tags.

Here's a starting regular expression that does the basics:
$field =~ s/
                  (.+?)
                  \s*
                  (?:\r|\n|$)+
            /<p>\1<\/p>\n/xgis;

However, this fails to be functional under the following circumstances:
- if a line already contains <p> tags, we shouldn't nest them twice.
- if an existing <p> tag contains linebreaks, we should just remove the linebreaks rather than add <p> tags around each line.

Below is some sample text and the correctly formatted text:


Well, for one I am a <b>very</b> diligent person who constantly pursues justice wherever it may be found.
I'm also good at linebreaks, as you can see here.

Finally, this is also on its own line, but shouldn't be be any different from the previous line. We shall see how it goes.
<p>This is its own paragraph.</p> We need to be careful here so we don't create nested paragraphs.
Here's another example. <p>Be sure to avoid the nested paragraphs here.</p>
Here's yet another example. <p>Again, be sure to avoid the nested paragraphs here.</p>
<p>This paragraph
for some reason has whitespace
that we don't need.</p>

<p>Standalone Paragraph</p>

================== converted to:

<p>Well, for one I am a <b>very</b> diligent person who constantly pursues justice wherever it may be found.
I'm also good at linebreaks, as you can see here.</p>
<p>Finally, this is also on its own line, but shouldn't be be any different from the previous line. We shall see how it goes.</p>
<p>This is its own paragraph.</p>
<p>We need to be careful here so we don't create nested paragraphs.</p>
<p>Here's another example.</p>
<p>Be sure to avoid the nested paragraphs here.</p>
<p>Here's yet another example.</p>
<p>Again, be sure to avoid the nested paragraphs here.</p>
<p>This paragraph for some reason has whitespace that we don't need.</p>
<p>Standalone Paragraph</p>

Open in new window

LVL 14
tomaugerdotcomAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

SuperdaveCommented:
$field =~ s#(?:<p>(.+?)</p>(?:\r|\n|$)*)|(?:
                  (.+?)
                  \s*
                  (?:\r|\n|$|(?=<p>))+
            )
            #<p>\1\2<\/p>\n#xgis;

That does most of it.  It leaves the blank lines at lines 11 and 13 in your test which you could remove with another regular expression.  I don't think it would be possible to do that with one re.  And thanks for a good start, it would have been hard for me to do that from scratch.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
tomaugerdotcomAuthor Commented:
Superdave, you're a fricken genius. I was barking up the wrong tree trying to figure out negative look-ahead assertions that were, well, asserting diddly-squat.

Appreciate the help. Stay tuned - I have a follow up question.
0
tomaugerdotcomAuthor Commented:
For the sake of posterity I've done the extra newline stripping and have commented out the regular expression.
$field =~ s/
			(?:					# EITHER...
				<p>				
				(.+?)			# look for anything inside of <p> tags
				<\/p>			# (including the <p> tags themselves
				(?:\r|\n|$)*	# up to the next newline or the end of the line
			)
			|					# OR.....
			(?:
                (.+?)			# anything (not starting with a <p> tag)
                \s*				# (eating whitespace)
                (?:
					\r|\n		# to the next newline
					|$			# or the end of the line
					|(?=<p>)	# or the start of the next <p> tag
				)+
            )
        /<p>\1\2<\/p>\n/xgis;	# and then stick either one inside <p> tags
        
        
        $field =~ s/
			(?<!<\/p>)
        	(?:\r|\n)+
        / /xgis;
	}

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.