Unencode html escape characters

Posted on 2004-11-13
Medium Priority
Last Modified: 2008-01-09
I am using the GoogleAPI and it returns titles as html.  This is an example of a title they would return:

SBA&#39;s Shareware Library - Files For <b>Starting</b> <b>Your</b> <b>Business</b>

I want to convert this to "SBA's Shareware Library - Files for Starting Your Business".  HttpUtility.UrlDecode doesn't seem to be able to convert the &#39;s to 's.  That is the most important part of this question: Returning an ASCII representation of a string containing html escape sequences.  However, if anyone wants to show me an easy way of getting rid of HTML tags while preserving ALL text, I would up the points for the question and give them to you.  Maybe a regex?
Question by:thedude112286

Expert Comment

ID: 12577022
using System.Text.RegularExpressions;

            private void button1_Click(object sender, System.EventArgs e)
                  string InStr = "<b>&#39;&#8212;</B>";
                  string[] Tokens = new String[] {"<[bB]>","</[bB]>","&#[0]*39;","&#[0]*60;","&#[0]*64;","&#[0]*93;","&#123;","&#125;","&#133;","&#135;","&#146;",         "&#148;","&#150;","&#153;","&#162;","&#165;","&#169;","&#172;","&#176;","&#178;","&#185;","&#188;","&#190;","&#247;","&#8221;",
                  string[] ReplaceVals = new String[] {"","","'","<","@","]","{","}","…","‡","’","”","–","™","¢","¥","©","¬","°","²","¹","¼","¾","÷","”",">","[","`","|","~","†","‘","“","•","—","¡","£","¦","«","®",
                  InStr = Replace(InStr, Tokens, ReplaceVals);

            private string Replace(string InStr, string[] Tokens, string[] ReplaceVals)
                  int i = 0;
                  foreach (string str in Tokens)
                        InStr = Regex.Replace(InStr, str, ReplaceVals[i]);
                  return InStr;

Accepted Solution

der_jth earned 500 total points
ID: 12577981
UrlDecode isn't meants for this. An url-encoded string looks like this: "foo%E4bar"; you can see the URL coding is much different from the HtmlEncoding used in the markup. Try HttpUtility.HtmlDecode instead.

For removing all HTML tags, try this: Regex.Replace(string, "<[^>]+>", "")

It's not exactly correct according to the SGML spec, but it works correctly enough 99% of the time and is a few thousand lines shorter than the correct approach (a full-blown SGML parser with some quirk parsing thrown in).

Author Comment

ID: 12579047
Thank you very much, it works perfectly!

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article introduced a TextBox that supports transparent background.   Introduction TextBox is the most widely used control component in GUI design. Most GUI controls do not support transparent background and more or less do not have the…
We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
As many of you are aware about Scanpst.exe utility which is owned by Microsoft itself to repair inaccessible or damaged PST files, but the question is do you really think Scanpst.exe is capable to repair all sorts of PST related corruption issues?
Suggested Courses

578 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question