Link to home
Start Free TrialLog in
Avatar of huji
hujiFlag for United States of America

asked on

Regular expression question

Hello,
I tried my best but couldn't find how to achieve this. I've got some HTML files, which contain such pieces of text:

<a href="lob lob">lob lob lob lob</a><br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
<br>

I want to convert that piece of code above to such a (better formatted) way:

<p><a href="lob lob">lob lob lob lob</a></p>
<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob</p>

However, I can't figure the regexp with the newline successfully. I'm trying to use the find and replace feature of Dreamweaver (which accepts regular expressions) for it, but it doesn't seem to work with \n for new lines.
There is no insist to do it in Dreamweaver environment. My second choice is to let PHP or ASP open these files, make the conversions, and save them.

Any help is highly apprecited
Huji
SOLUTION
Avatar of Roonaan
Roonaan
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi huji, your example doesn't make much sense. For example, what is the fate of all <br> tags ?

However, based on that example only, I have done my best:

<?
$str = <<<XXX
<a href="lob lob">lob lob lob lob</a><br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
<br>
XXX;

$new = preg_replace("#^(.*)$#is","<p>$1</p>",$str);
$new = preg_replace("#</a>#is","</p>$1\n<p>",$new);
$new = preg_replace("#<p><br>#is","<p>",$new);

echo $new;
?>


---
Harish
Avatar of huji

ASKER

Well, I'm sorry my example didn't make that sense. I meant to show that I have several paragraphs of text, but they don't appear inside a pair of <p>...</p>; instead they are lines of text ended in <br> which is not what I want!

I'll be testing your suggestions right away.
Huji
Well, in that case you may try this:

$new = preg_replace("#^(.*)$#is","<p>$1</p>",$str);
$new = preg_replace("#\n?<br>\s*#is","</p>\n<p>",$new);
$new = preg_replace("#<p>\s*</p>#is","",$new);

instead of the previous 3 preg_replace statements.
Avatar of huji

ASKER

Here is a sample text again:

<a href="lob lob">lob lob lob lob</a><br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
<br>

Here is my modificatino of Roonan's solution:

</a>.*<br>\r\n(.*<br>\r\n)*<br>

The above successfully selects the whole paragraph (from the </a><br> before it, to the <br> after it.) Now I need a way (using backreferences) to make it:

- remove all <br>s from the (.*<br>\r\n)* part.
- add <p> before (.*<br>\r\n)* and </p> after it.

Please advise
you could try and extend the preg_Replace to have /ism modifiers instead of /i only.

-r-
Avatar of huji

ASKER

mgh_mgharish, I would prefer the solution to do it with only one regexp replace function. Not sure if it is possible thought, since I need to have a backreference to patterns repeated for unkonwn times.
Huji, my last set of expressions do exactly that.
Is it a constraint to use only one ??
Avatar of huji

ASKER

>> Is it a constraint to use only one ??

It is that, I still prefer to do the replace in Dreamweaver environement, and there, using multiple replaces could be a little pain. This is not a "constraint" indeed, but a matter of ease.

And I agree with you that your three command solution does it perfectly.

Ronaan,
While in the Dreamweaver, I don't need to add /ism. I don't wan't to use /s and /im is automatically active in that environment.

Thanks
huji
Avatar of huji

ASKER

Excuse me mgh_mgharish, but your solution has a little problem. Here is its output:

<p><a href="lob lob">lob lob lob lob</a></p>
<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob</p>
<p>lob lob lob lob lob lob lob lob</p>
<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob</p>


Here is what I want:

<p><a href="lob lob">lob lob lob lob</a></p>
<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob</p>

Any modifications?
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of huji

ASKER

Working code:

<?
$str = "<a href=\"lob lob\">lob lob lob lob</a><br>\n";
$str .= "lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>\n";
$str .= "lob lob lob lob lob lob lob lob<br>\n";
$str .= "lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>\n<br>";
echo $str;
echo "<hr>\n";

$new = preg_replace("#^(.*)$#is","<p>$1</p>",$str);
$new = preg_replace("#\n?<br>\s*#is","</p>\n<p>",$new);
$new = preg_replace("#<p>\s*</p>#is","",$new);
echo $new;
echo "<hr>\n";

$new_r = preg_replace("#</a><br>\n(.*<br>\n)*<br>#is","<a></p>\r\n<p>$1</p>",$str);
echo $new_r;
?>

$new has too much <p>..</p>s as I stated above.
$new_r doesn't have excessive <p>..</p>s but the <br>s inside $1 should be removed some way! (I don't know how to replace something inside a backreference. I tried this as well:

$new_r = preg_replace("#</a><br>\n(.*<br>\n)*<br>#is","<a></p>\r\n<p>".str_replace("<br>","","$1")."</p>",$str);


but no way.)
Avatar of huji

ASKER

mgh_mgharish,
Your last code did it correctly! Thank you.
My last question: Isn't there a way to replace something inside a backreference?
> Isn't there a way to replace something inside a backreference?
Not that I know of..
Avatar of huji

ASKER

Or at least, how can we match such a pattern:

<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
</p>

(by matching <p>, <br>s and </p>), and convert it to:

<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob
</p>

where, the number of lines ending in <br> is varies between one and ten.
Avatar of huji

ASKER

Of course one possible solution is to convert this:

<p>lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
</p>

to this:

<p>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob<br>
lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>
</p>

then preg_replace("#(.*)<br>#is","$1",.......)

;)
Avatar of huji

ASKER

I will close this question, with these two solutions:

<?
$str = "<a href=\"lob lob\">lob lob lob lob</a><br>\n";
$str .= "lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>\n";
$str .= "lob lob lob lob lob lob lob lob<br>\n";
$str .= "lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob lob<br>\n<br>";
echo $str;
echo "<hr>\n";

$new = $str;
$new = preg_replace("#^(.*)$#ims","<p>$1</p>",$new);
$new = preg_replace("#><br>#is","></p>\n<p>",$new);
$new = preg_replace("#<p>\s*#is","<p>",$new);
$new = preg_replace("#<br>#is","",$new);
echo $new;
echo "<hr>\n";

$new_r = preg_replace("#</a><br>\n(.*<br>\n)*<br>#is","<a></p>\n<p>\n"."$1"."</p>",$str);
$new_r = preg_replace("#(.*)?<br>#i","$1",$new_r);
echo $new_r;
?>

Unfortunately, none of them offer a single step method. However I like them both!

Thanks for your contribution
Huji
What's your problem ? You mean, it should replace the <br> tags that are only inside the <p> tags ?
Avatar of huji

ASKER

mgh_mgharish, I solved it. The last code I posted!
Thanks a lot agian, for your help.
Huji