Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 367
  • Last Modified:

parse illegal characters

Hi Experts,
I have following html text as a sample, which I need to parse for illegal characters.
But the parse should be only for html values and not for html tags.

eg.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:mce="mce">
<body>
<p class ="test">afa sf  & as ffaf</p><p class="ss">adssfa fa sfasf <br />"<i>sds</i>"</p>
</body>
</html>


any help would be appreciate

Thanks
0
saloj
Asked:
saloj
1 Solution
 
s_chilkuryCommented:
Check the following:
http://www.codeproject.com/Articles/57176/Parsing-HTML-Tags-in-Csharp.aspx

Also, you can use HTMLAgility Pack which does the same.
0
 
Sudhakar PulivarthiCommented:
Hi Saloj,

i am attaching the code which will parse the values present in the HTML data.
Please look it might be useful. In need of further expantion in code will do. What are those ilegal chars ur looking for? and what you want to do with those chars?
public static string ParseTheValuesInHTML(string htmlData)
        {
            int startIndex = 0;

            if (htmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = htmlData.IndexOf("<", startIndex + 1);
                    int lastIndex = nextCount - htmlData.IndexOf(">", startIndex);

                    if (nextCount > 0 && lastIndex > 0)
                    {
                        // Find the text between > and <. 
                        string value = htmlData.Substring(htmlData.IndexOf(">", startIndex) + 1, lastIndex - 1);

                        if (value != "")
                        {
                            // The value string is what you are looking for between HTML tags.
                            // Here you can verify any illegal chars present in it and process as u want.
                        }
                    }

                    // Find the next tag.
                    startIndex = htmlData.IndexOf(">", startIndex + 1);
                }
            }

            return htmlData;
        }

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
salojAuthor Commented:
Hi EE, i have following string, when I parse it with the following code, it parse all data including the html tags also. But I only need to parse value.
can anybody help me on the code.


string strA = "<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:mce=\"mce\"><head><style></style><title>sadfafsfsdf asdf asd sadf saf</title></head><body class=\"hhh\"><p style=\"margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0pt\">content has posted the latest <a href=\"http://www.xyz.com/media/763/Dynacor_Gold_Mines_TSX:_DNG_News_Alert/\" target=\"_blank\">video</a> \"News Alert\" for Dynacor Gold Mines Inc. If the link is unavailable, please visit <a href=\"http://www.xyz.com/\" target=\"_blank\">www.xyz.com</a> and enter \"Dynacor\" in the search box.</p><p class=\"hhh\" style=\"margin-top: 0pt; margin-right: 0pt;  margin-left: 0pt; margin-bottom:0pt;\">According to Metanor&apos;s press release:<br class=\"hhh\" />\"<i class=\"hhh\">S of US$500 (the \"Per Ounce Payments\") and the then prevailing market price per ounce of gold. Sandstorm will (i) US$5 million upon signing of the </i><i class=\"hhh\">agreement, (ii) US$9 million,once Metan</i>.\"</p></body></html>";
ParseTheValuesInXML(strA);

public static string ParseTheValuesInXML(string xmlData)
        {
            int startIndex = 0;

            if (xmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = xmlData.IndexOf("<", startIndex + 1);
                    //int lastIndex = nextCount - xmlData.IndexOf(">", startIndex);

                    //if (nextCount > 0 && lastIndex > 0)
                    if (nextCount > 0)
                    {
                        // Find the text between > and <. 
                        string value = xmlData.Substring(xmlData.IndexOf(">", startIndex) + 1, nextCount - xmlData.IndexOf(">", startIndex) - 1);
                        if (value != "")
                        {                            
                            // Replace the text with xml special data.
                            xmlData = xmlData.Replace(value, XmlSpecial(value));
                        }
                    }

                    // Find the next tag.
                    startIndex = xmlData.IndexOf(">", startIndex + 1);
                }
            }
            return xmlData;
        }


    private static string XmlSpecial(string strTxt)
        {
            if (strTxt.Contains("&"))
            {
                // Convert value to all special xml character.
                strTxt = strTxt.Replace("&amp;", "&");
                // Start replacing the specials.
                strTxt = strTxt.Replace("&", "&amp;");
                strTxt = strTxt.Replace("&amp;apos;", "&apos;");
                strTxt = strTxt.Replace("&amp;quot;", "&quot;");
                strTxt = strTxt.Replace("&amp;lt;", "&lt;");
                strTxt = strTxt.Replace("&amp;gt;", "&gt;");
            }
            if (strTxt.Contains("<"))
            {
                strTxt = strTxt.Replace("<", "&lt;");
            }
            if (strTxt.Contains(">"))
            {
                strTxt = strTxt.Replace(">", "&gt;");
            }
            if (strTxt.Contains("\""))
            {
                strTxt = strTxt.Replace("\"", "&quot;");
            }
            if (strTxt.Contains("'"))
            {
                strTxt = strTxt.Replace("'", "&apos;");
            }
            return strTxt;
        }

Open in new window

0
 
informaniacCommented:
Html values?
0
 
salojAuthor Commented:
Hi Sudhakar,
Thanks for ur quick response. I am using your code from previous solution.
on the following code when I have value = "\"";
it is going to replace all  to &quot; how can I filter it?

if (value != "")
           {
          htmlData = htmlData.Replace(value, XmlSpecial(value));
         }
0
 
Sudhakar PulivarthiCommented:
Hi Saloj,

Please replace this statement in the code.

xmlData = xmlData.Replace(">" + value + "<", ">" + XmlSpecial(value) + "<");

This will work!!!
0
 
Sudhakar PulivarthiCommented:

Actually, Since one of the tag value was " (quote) so as part of our previous code, The " was replaced with &quot; in the source string. This caused to replace all the occurance of " even in the tags also.

Hence now the fix made is to include the boundaries also with the value so that the replace happens only in the parsed value in the whole string.

Happy Working... take care
0
 
salojAuthor Commented:
Thanks Sudhakar!
you have great technique!

Happy working!
Take care
0
 
salojAuthor Commented:
EXCELLENT !!!
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now