asked on

ASP.Net/C# - Regex question on HTML tag strip

Hello all. Right now I am stripping out any font tags in a string that stores a HTML string. I also need to deal with spans it looks like because it can have Font-Size, Font-Family etc. I am thinking I might have to strip out the entire span tag. How can I add this to my Regex strip to strip out any Span tags? Also if you can think of any other ways other HTML only not style sheet that font might come up. Here is the current function I have to strip the font:

      public static System.String StripFontHtml(System.String Html)
            {
                  Regex ex = new Regex("</?FONT[^<>]+>");

                  Match RegMatch = ex.Match(Html);

                  while (RegMatch.ToString() != "")
                  {
                        Html = Html.Replace(RegMatch.ToString(), "");
                        RegMatch = RegMatch.NextMatch();
                  }

                  return Html;
            }

The better thing also maybe to just strip out the style="" completly if that is possible. The goal is to not strip all HTML because I want to allow paragraphs and breaks etc. but strip out any font attributes.

<SPAN style="FONT-SIZE: 12pt; FONT-FAMILY: ''''Times New Roman''''"></SPAN>

DarkoLord

Hi, this regex matches the Span tags containing the "style" element... It returns start tag, contents and end tag, so you should be easily able to replace the start and end tags

(<[^>]*?span[^>]*?(?:style)[^>]*>)((?:.*?(?:<[ \r\t]*span[^>]*>?.*?(?:<.*?/.*?span.*?>)?)*)*)(<[^>]*?/[^>]*?span[^>]*?>)

sbornstein2

ASKER

So I can just add that to my function such as:

public static System.String StripFontHtml(System.String Html)
            {
                  Regex ex = new Regex("</?FONT[^<>]+>");

                  Match RegMatch = ex.Match(Html);

                  while (RegMatch.ToString() != "")
                  {
                        Html = Html.Replace(RegMatch.ToString(), "");
                        RegMatch = RegMatch.NextMatch();
                  }

                  Regex ex = new Regex("<[^>]*?span[^>]*?(?:style)[^>]*>)((?:.*?(?:<[ \r\t]*span[^>]*>?.*?(?:<.*?/.*?span.*?>)?)*)*)(<[^>]*?/[^>]*?span[^>]*?>");

                  Match RegMatch = ex.Match(Html);

                  while (RegMatch.ToString() != "")
                  {
                        Html = Html.Replace(RegMatch.ToString(), "");
                        RegMatch = RegMatch.NextMatch();
                  }
                  return Html;
            }

Does this look like it will work?

sbornstein2

ASKER

I am wondering if I can place it all together for better performance?

DarkoLord

No, this one matches the contents also... This one matches start and end tags only:

(<[^>]*?span[^>]*?(?:style)[^>]*>)(?:(?:.*?(?:<[ \r\t]*span[^>]*>?.*?(?:<.*?/.*?span.*?>)?)*)*)(<[^>]*?/[^>]*?span[^>]*?>)

sbornstein2

ASKER

Sorry Dark for the delay. I am confused on what you mean by:
"No, this one matches the contents also... This one matches start and end tags only:"

ASKER CERTIFIED SOLUTION

DarkoLord

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

sbornstein2

ASKER

thanks Dark