Solved

parse illegal characters

Posted on 2011-02-16
10
347 Views
Last Modified: 2012-05-11
Hi Experts,
I have following html text as a sample, which I need to parse for illegal characters.
But the parse should be only for html values and not for html tags.

eg.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:mce="mce">
<body>
<p class ="test">afa sf  & as ffaf</p><p class="ss">adssfa fa sfasf <br />"<i>sds</i>"</p>
</body>
</html>


any help would be appreciate

Thanks
0
Comment
Question by:saloj
10 Comments
 
LVL 9

Expert Comment

by:s_chilkury
ID: 34913824
Check the following:
http://www.codeproject.com/Articles/57176/Parsing-HTML-Tags-in-Csharp.aspx

Also, you can use HTMLAgility Pack which does the same.
0
 
LVL 8

Expert Comment

by:jimsweb
ID: 34913880
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
ID: 34914037
Hi Saloj,

i am attaching the code which will parse the values present in the HTML data.
Please look it might be useful. In need of further expantion in code will do. What are those ilegal chars ur looking for? and what you want to do with those chars?
public static string ParseTheValuesInHTML(string htmlData)
        {
            int startIndex = 0;

            if (htmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = htmlData.IndexOf("<", startIndex + 1);
                    int lastIndex = nextCount - htmlData.IndexOf(">", startIndex);

                    if (nextCount > 0 && lastIndex > 0)
                    {
                        // Find the text between > and <. 
                        string value = htmlData.Substring(htmlData.IndexOf(">", startIndex) + 1, lastIndex - 1);

                        if (value != "")
                        {
                            // The value string is what you are looking for between HTML tags.
                            // Here you can verify any illegal chars present in it and process as u want.
                        }
                    }

                    // Find the next tag.
                    startIndex = htmlData.IndexOf(">", startIndex + 1);
                }
            }

            return htmlData;
        }

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 2

Author Comment

by:saloj
ID: 34914039
Hi EE, i have following string, when I parse it with the following code, it parse all data including the html tags also. But I only need to parse value.
can anybody help me on the code.


string strA = "<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:mce=\"mce\"><head><style></style><title>sadfafsfsdf asdf asd sadf saf</title></head><body class=\"hhh\"><p style=\"margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0pt\">content has posted the latest <a href=\"http://www.xyz.com/media/763/Dynacor_Gold_Mines_TSX:_DNG_News_Alert/\" target=\"_blank\">video</a> \"News Alert\" for Dynacor Gold Mines Inc. If the link is unavailable, please visit <a href=\"http://www.xyz.com/\" target=\"_blank\">www.xyz.com</a> and enter \"Dynacor\" in the search box.</p><p class=\"hhh\" style=\"margin-top: 0pt; margin-right: 0pt;  margin-left: 0pt; margin-bottom:0pt;\">According to Metanor&apos;s press release:<br class=\"hhh\" />\"<i class=\"hhh\">S of US$500 (the \"Per Ounce Payments\") and the then prevailing market price per ounce of gold. Sandstorm will (i) US$5 million upon signing of the </i><i class=\"hhh\">agreement, (ii) US$9 million,once Metan</i>.\"</p></body></html>";
ParseTheValuesInXML(strA);

public static string ParseTheValuesInXML(string xmlData)
        {
            int startIndex = 0;

            if (xmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = xmlData.IndexOf("<", startIndex + 1);
                    //int lastIndex = nextCount - xmlData.IndexOf(">", startIndex);

                    //if (nextCount > 0 && lastIndex > 0)
                    if (nextCount > 0)
                    {
                        // Find the text between > and <. 
                        string value = xmlData.Substring(xmlData.IndexOf(">", startIndex) + 1, nextCount - xmlData.IndexOf(">", startIndex) - 1);
                        if (value != "")
                        {                            
                            // Replace the text with xml special data.
                            xmlData = xmlData.Replace(value, XmlSpecial(value));
                        }
                    }

                    // Find the next tag.
                    startIndex = xmlData.IndexOf(">", startIndex + 1);
                }
            }
            return xmlData;
        }


    private static string XmlSpecial(string strTxt)
        {
            if (strTxt.Contains("&"))
            {
                // Convert value to all special xml character.
                strTxt = strTxt.Replace("&amp;", "&");
                // Start replacing the specials.
                strTxt = strTxt.Replace("&", "&amp;");
                strTxt = strTxt.Replace("&amp;apos;", "&apos;");
                strTxt = strTxt.Replace("&amp;quot;", "&quot;");
                strTxt = strTxt.Replace("&amp;lt;", "&lt;");
                strTxt = strTxt.Replace("&amp;gt;", "&gt;");
            }
            if (strTxt.Contains("<"))
            {
                strTxt = strTxt.Replace("<", "&lt;");
            }
            if (strTxt.Contains(">"))
            {
                strTxt = strTxt.Replace(">", "&gt;");
            }
            if (strTxt.Contains("\""))
            {
                strTxt = strTxt.Replace("\"", "&quot;");
            }
            if (strTxt.Contains("'"))
            {
                strTxt = strTxt.Replace("'", "&apos;");
            }
            return strTxt;
        }

Open in new window

0
 
LVL 20

Expert Comment

by:informaniac
ID: 34914085
Html values?
0
 
LVL 2

Author Comment

by:saloj
ID: 34914095
Hi Sudhakar,
Thanks for ur quick response. I am using your code from previous solution.
on the following code when I have value = "\"";
it is going to replace all  to &quot; how can I filter it?

if (value != "")
           {
          htmlData = htmlData.Replace(value, XmlSpecial(value));
         }
0
 
LVL 11

Accepted Solution

by:
Sudhakar Pulivarthi earned 500 total points
ID: 34914160
Hi Saloj,

Please replace this statement in the code.

xmlData = xmlData.Replace(">" + value + "<", ">" + XmlSpecial(value) + "<");

This will work!!!
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
ID: 34914179

Actually, Since one of the tag value was " (quote) so as part of our previous code, The " was replaced with &quot; in the source string. This caused to replace all the occurance of " even in the tags also.

Hence now the fix made is to include the boundaries also with the value so that the replace happens only in the parsed value in the whole string.

Happy Working... take care
0
 
LVL 2

Author Comment

by:saloj
ID: 34923312
Thanks Sudhakar!
you have great technique!

Happy working!
Take care
0
 
LVL 2

Author Closing Comment

by:saloj
ID: 34923316
EXCELLENT !!!
0

Featured Post

PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In order to hide the "ugly" records selectors (triangles) in the rowheaders, here are some suggestions. Microsoft doesn't have a direct method/property to do it. You can only hide the rowheader column. First solution, the easy way The first sol…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question