We help IT Professionals succeed at work.

PHP strip part of string if exists

peterbrowne
peterbrowne asked
on
I have built a blog in php, but have found that if a user copies text from a Word doc, then that text when pasted into the blog's editor will include Word code above the text;  The Word code starts with <!-- and ends with -->.  What I need to do is if the Word code is present, to strip it from the text that follows.  

The code that needs to be removed looks like:

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Calibri","sans-serif"; mso-fareast-font-family:"Times New Roman"; mso-bidi-font-family:"Times New Roman";} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-fareast-language:EN-US;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} @page Section1 {size:595.3pt 841.9pt; margin:72.0pt 72.0pt 72.0pt 72.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} -->
Comment
Watch Question

just use str_replace()
$string = word paste
$word_phrase = "<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Calibri","sans-serif"; mso-fareast-font-family:"Times New Roman"; mso-bidi-font-family:"Times New Roman";} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-fareast-language:EN-US;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} @page Section1 {size:595.3pt 841.9pt; margin:72.0pt 72.0pt 72.0pt 72.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} -->";
$string = str_replace($word_prhase, "", $string);
Shinesh PremrajanEngineering Manager

Commented:
better option is to use a regular expression to clean up the mess, since str_replace probably will not be suitable for the dynamic content that is coming in the text.

$word_phrase = 'patt <!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Calibri","sans-serif"; mso-fareast-font-family:"Times New Roman"; mso-bidi-font-family:"Times New Roman";} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-fareast-language:EN-US;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} @page Section1 {size:595.3pt 841.9pt; margin:72.0pt 72.0pt 72.0pt 72.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} --> this is ';

$string=preg_replace("/\<\!\-\-(.*)\-\-\>/i","=",$word_phrase);

Hope this helps
Shinesh PremrajanEngineering Manager

Commented:
sorry that was the testing code, this is the correct one
$string=preg_replace("/\<\!\-\-(.*)\-\-\>/i","",$word_phrase);
Expert of the Quarter 2010
Expert of the Year 2010

Commented:
If you don't care about any comments at all (including but not limited to the Word comments <!-- .. -->, then you can use this


$string="<!-- test here and there //--> some text I want to keep <!-- test here and there //--> <br/>Is it greedy?";
$regex = "#(<!--)(.*)?(-->)#Ue";
$output = preg_replace($regex,"",$string);

replace $string with your actual text variable

Author

Commented:
Actually, the Word garbage may be part of the user's posting, so that this code precedes the actual text as typed in Word.  When the user pastes from Word, the intended posting and the code comes in together...so I need to strip out the code from the intended message, ie everthing before '-->'
Expert of the Quarter 2010
Expert of the Year 2010

Commented:
The ? and /U in the regex causes preg_replace to be non-greedy.
If you have multiple comments, not having them causes everything from the start of the first comment to the end of last comment to disappear.

Author

Commented:
Not working...  I've used this:

$comment = mysql_real_escape_string($_POST['addnewcomment']);
$regex = "#(<!--)(.*)?(-->)#Ue";
$output = preg_replace($regex,"",$comment);

I still get:

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page Section1 {size:595.3pt 841.9pt; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} --> Nullam ligula velit, ullamcorper eu tempor sed, feugiat vitae orci. Phasellus mi purus, ullamcorper in pellentesque at, imperdiet ac lacus. Praesent ultrices, mauris id euismod sollicitudin, nisl lectus consequat neque, ac tristique sem diam sit amet lorem.
Expert of the Quarter 2010
Expert of the Year 2010
Commented:
Try assigning the preg_replace output back to $comment, if that is what is used further in the code

$comment = preg_replace($regex,"",$comment);

Author

Commented:
nO STILL DOESN'T WORK:

$comment = mysql_real_escape_string($_POST['addnewcomment']);
$regex = "#(<!--)(.*)?(-->)#Ue";
$comment = preg_replace($regex,"",$comment);

The only other place that 4comment is used is:

//add comment to database 'comment' table
if (isset($_POST['submit']))
{
      $query_comment = "INSERT INTO comment (comment_id,comment,comment_date,user_id,page_id)
          VALUES ('','$comment',NOW(),'$user_id','$page_id')";
      $result_comment = mysql_query($query_comment, $connection) or die(mysql_error());
}
Avinash ZalaWeb Expert

Commented:
try this attached code:

Hope this helps,
Addy
<?php
	
	function get_string_between($string, $start, $end)
	{
		$string = " ".$string; 
		$ini = strpos($string,$start); 
		if ($ini == 0) return ""; 
		$ini += strlen($start); 
		$len = strpos($string,$end,$ini) - $ini; 
		return substr($string,$ini,$len); 
	}
	
	$str='demofinr is this <!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Calibri","sans-serif"; mso-fareast-font-family:"Times New Roman"; mso-bidi-font-family:"Times New Roman";} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-fareast-language:EN-US;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} @page Section1 {size:595.3pt 841.9pt; margin:72.0pt 72.0pt 72.0pt 72.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} --> while gone.';
	
	$between_str= get_string_between($str,'<!--','-->');
	echo str_replace('-->','',str_replace('<!--','',str_replace($between_str,'',$str)));
?>

Open in new window

Expert of the Quarter 2010
Expert of the Year 2010

Commented:
I forgot about the multi-line issue.  The s option at the end here makes preg_replace work with multi-line strings

$regex = "#(<!--)(.*)?(-->)#Ues";

Author

Commented:
Actually your solution was correct.  I checked what was actually going into the database for these user inputs (comments) and '<!--' was actually going in as '<!--' and '-->' was becoming '-->'.

So, the following works:

$comment = mysql_real_escape_string($_POST['addnewcomment']);
$regex = "#(<!--)(.*)?(-->)#Ue";
$comment = preg_replace($regex,"",$comment);

Thanks for your help and others for your suggestions!!

Cheers,

Peter

Author

Commented:
Mmmm...looks like it's substituting here too:
<p>&lt;!--

and

--&gt;

so:

$comment = mysql_real_escape_string($_POST['addnewcomment']);
$regex = "#(<p>&lt;!--)(.*)?(--&gt;)#Ue";
$comment = preg_replace($regex,"",$comment);

Open in new window