Remove <p> and </p> pairs if separated by CrLf and/or spaces

bogorman
bogorman used Ask the Experts™
on
I have a text file containing coding from html pages.

There are a number of cases where there is the following content:

<p>
        </p>

So, between the tags is a CrLf character with varying number of spaces, one side and/or the other of it.

How can I remove the pair of tags if between them there are the above characters?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Meir RivkinFull stack Software Engineer

Commented:
this function will remove <p> tags (remove the open/close tag and in between) if the content is newline and empty spaces:

Private Function FormatText(str As String) As String
	Dim i2 As Integer = 0
	Dim sb As New StringBuilder()
	While i2 > -1
		Dim i1 = str.IndexOf("<p>", i2)

		If i1 = -1 Then
			sb.Append(str.Substring(i2))
			Exit While
		Else
			sb.Append(str.Substring(i2, i1 - i2))
		End If

		i2 = str.IndexOf("</p>", i1)
		If i2 = -1 Then
			Exit While
		End If

		Dim tok = str.Substring(i1 + 3, i2 - i1 - 3)

		If Not String.IsNullOrWhiteSpace(tok.Trim().Replace(Environment.NewLine, String.Empty)) Then
			sb.Append("<p>" & tok & "</p>")
		End If
		i2 += 4
	End While
	Return sb.ToString()
End Function

Open in new window

Top Expert 2011

Commented:
Here is another option using a Regular Expression:
Dim rx As Regex = new Regex("<p>\s+</p>", RegexOptions.Multiline or RegexOptions.IgnoreCase)

theString = rx.Replace(theString, "")

Open in new window

Meir RivkinFull stack Software Engineer

Commented:
Does it remove newline characters?
Top Expert 2011

Commented:
<p>\s+</p> removes the tags (<p> & </p>) and all spaces/newline chars in between.
Top Expert 2011

Commented:
If you want to keep the spaces/newline chars and just remove the tags, use the following expression:
Dim rx As Regex = new Regex("(<p>)(\s+)(</p>)", RegexOptions.Multiline or RegexOptions.IgnoreCase)

theString = rx.Replace(theString, "$2")

Open in new window

Meir RivkinFull stack Software Engineer

Commented:
Neat

Author

Commented:
Hi,
Thanks, both of you.  
Am first testing sedgwick's solution and will then test wdosanjos's. Will then share/apportion points.
sedgwicks:
Have had to modify the function as my version of vb.net does not support IsNullOrWhiteString. Here it is:

Private Function Removeps(ByVal str As String) As String
        Dim i2 As Integer = 0
        Dim sb As New StringBuilder()
        While i2 > -1
            Dim i1 = str.IndexOf("<p>", i2)

            If i1 = -1 Then
                sb.Append(str.Substring(i2))
                Exit While
            Else
                sb.Append(str.Substring(i2, i1 - i2))
            End If

            i2 = str.IndexOf("</p>", i1)
            If i2 = -1 Then
                Exit While
            End If

            Dim tok = str.Substring(i1 + 3, i2 - i1 - 3)

            If Not (String.IsNullOrEmpty(str) OrElse str.Trim().Length = 0) Then
                sb.Append("<p>" & tok & "</p>")
            End If
            i2 += 4
        End While
        Return sb.ToString()
    End Function


I call it with:
 Body = Removeps(Body)

But it does not appear to remove the <p> anything containing spaces or CrLfs </p>

Am I doing something stupid?

Author

Commented:
wdosanjos:

Have tried yours also. Almost works but does not seem to remove <p></p>.  Have used:

Dim rx As Regex = New Regex("<p>\s+</p>", RegexOptions.Multiline Or RegexOptions.IgnoreCase)

        Body = rx.Replace(Body, "")

Here is part of the result:


On Bearing And Forebearing In Married Life|
<p><strong>
        Edward Holloway</strong> FAITH Magazine November-December 2002
      </p>

<p><strong>
      Loving as you love yourself      </strong></p>
<p></p>


I can just add a replace statement but it would be nice if it could handle everything as the coding is so simple and elegant.
Top Expert 2011
Commented:
Please try the following expression:
Dim rx As Regex = New Regex("<p>\s*</p>", RegexOptions.Multiline Or RegexOptions.IgnoreCase)

Open in new window

kaufmedGlanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
The "Multiline" option is superfluous--thought it doesn't hurt anything by it being there.

Author

Commented:
kaufmed,
Your coding does not seem to work, unless I have made a mistake.  The <p> </p> tags containing spaces and CrLf characters are still there.

wdosanjos,
Your coding seems to work fine.

Can you suggest adjustments, kaufmed as I would like to assign points.

Brian
Meir RivkinFull stack Software Engineer

Commented:
can u post an example of html page cause i tested my code and it seems to work.
kaufmedGlanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
Your coding does not seem to work, unless I have made a mistake.
I didn't post any code here. Are you referring to the previous question?

Author

Commented:
Hi kaufmed,
Really sorry. I meant sedgwick!
Sorry for troubling you, kaufmed
Brian

Author

Commented:
Hi sedgwick,
Herewith the contents of the Body variable before applying your coding to it. You will see the many instances of <p> tags followed by spaces/Crlfs, terminated by </p>
Brian
Body-text.txt

Author

Commented:
Hi sedgwick,
Is it possible for you to post modified coding so I can then assign points?   As you have spent a long time on this, if your coding works I feel I should split the points, otherwise I will assign them all to wdosanjos.
Brian
Meir RivkinFull stack Software Engineer

Commented:
Sure, i will
Meir RivkinFull stack Software Engineer
Commented:
i run this code on the file you've posted and the result is what u asked for:

    Private Function FormatText(str As String) As String
        Dim i2 As Integer = 0
        Dim sb As New StringBuilder()
        While i2 > -1
            Dim i1 = str.IndexOf("<p>", i2)

            If i1 = -1 Then
                sb.Append(str.Substring(i2))
                Exit While
            Else
                sb.Append(str.Substring(i2, i1 - i2))
            End If

            i2 = str.IndexOf("</p>", i1)
            If i2 = -1 Then
                Exit While
            End If

            Dim tok = str.Substring(i1 + 3, i2 - i1 - 3)
            Dim s = tok.Trim().Replace(Environment.NewLine, String.Empty)
            If Not String.IsNullOrEmpty(s) And Not s.Equals(String.Empty) Then
                sb.Append("<p>" & tok & "</p>")
            End If
            i2 += 4
        End While
        Return sb.ToString()
    End Function

Open in new window

Author

Commented:
Thanks to both of you. Both solutions work well.
Brian

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial