Solved

parse illegal characters

Posted on 2011-02-16
10
345 Views
Last Modified: 2012-05-11
Hi Experts,
I have following html text as a sample, which I need to parse for illegal characters.
But the parse should be only for html values and not for html tags.

eg.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:mce="mce">
<body>
<p class ="test">afa sf  & as ffaf</p><p class="ss">adssfa fa sfasf <br />"<i>sds</i>"</p>
</body>
</html>


any help would be appreciate

Thanks
0
Comment
Question by:saloj
10 Comments
 
LVL 9

Expert Comment

by:s_chilkury
ID: 34913824
Check the following:
http://www.codeproject.com/Articles/57176/Parsing-HTML-Tags-in-Csharp.aspx

Also, you can use HTMLAgility Pack which does the same.
0
 
LVL 8

Expert Comment

by:jimsweb
ID: 34913880
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
ID: 34914037
Hi Saloj,

i am attaching the code which will parse the values present in the HTML data.
Please look it might be useful. In need of further expantion in code will do. What are those ilegal chars ur looking for? and what you want to do with those chars?
public static string ParseTheValuesInHTML(string htmlData)
        {
            int startIndex = 0;

            if (htmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = htmlData.IndexOf("<", startIndex + 1);
                    int lastIndex = nextCount - htmlData.IndexOf(">", startIndex);

                    if (nextCount > 0 && lastIndex > 0)
                    {
                        // Find the text between > and <. 
                        string value = htmlData.Substring(htmlData.IndexOf(">", startIndex) + 1, lastIndex - 1);

                        if (value != "")
                        {
                            // The value string is what you are looking for between HTML tags.
                            // Here you can verify any illegal chars present in it and process as u want.
                        }
                    }

                    // Find the next tag.
                    startIndex = htmlData.IndexOf(">", startIndex + 1);
                }
            }

            return htmlData;
        }

Open in new window

0
 
LVL 2

Author Comment

by:saloj
ID: 34914039
Hi EE, i have following string, when I parse it with the following code, it parse all data including the html tags also. But I only need to parse value.
can anybody help me on the code.


string strA = "<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:mce=\"mce\"><head><style></style><title>sadfafsfsdf asdf asd sadf saf</title></head><body class=\"hhh\"><p style=\"margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0pt\">content has posted the latest <a href=\"http://www.xyz.com/media/763/Dynacor_Gold_Mines_TSX:_DNG_News_Alert/\" target=\"_blank\">video</a> \"News Alert\" for Dynacor Gold Mines Inc. If the link is unavailable, please visit <a href=\"http://www.xyz.com/\" target=\"_blank\">www.xyz.com</a> and enter \"Dynacor\" in the search box.</p><p class=\"hhh\" style=\"margin-top: 0pt; margin-right: 0pt;  margin-left: 0pt; margin-bottom:0pt;\">According to Metanor&apos;s press release:<br class=\"hhh\" />\"<i class=\"hhh\">S of US$500 (the \"Per Ounce Payments\") and the then prevailing market price per ounce of gold. Sandstorm will (i) US$5 million upon signing of the </i><i class=\"hhh\">agreement, (ii) US$9 million,once Metan</i>.\"</p></body></html>";
ParseTheValuesInXML(strA);

public static string ParseTheValuesInXML(string xmlData)
        {
            int startIndex = 0;

            if (xmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = xmlData.IndexOf("<", startIndex + 1);
                    //int lastIndex = nextCount - xmlData.IndexOf(">", startIndex);

                    //if (nextCount > 0 && lastIndex > 0)
                    if (nextCount > 0)
                    {
                        // Find the text between > and <. 
                        string value = xmlData.Substring(xmlData.IndexOf(">", startIndex) + 1, nextCount - xmlData.IndexOf(">", startIndex) - 1);
                        if (value != "")
                        {                            
                            // Replace the text with xml special data.
                            xmlData = xmlData.Replace(value, XmlSpecial(value));
                        }
                    }

                    // Find the next tag.
                    startIndex = xmlData.IndexOf(">", startIndex + 1);
                }
            }
            return xmlData;
        }


    private static string XmlSpecial(string strTxt)
        {
            if (strTxt.Contains("&"))
            {
                // Convert value to all special xml character.
                strTxt = strTxt.Replace("&amp;", "&");
                // Start replacing the specials.
                strTxt = strTxt.Replace("&", "&amp;");
                strTxt = strTxt.Replace("&amp;apos;", "&apos;");
                strTxt = strTxt.Replace("&amp;quot;", "&quot;");
                strTxt = strTxt.Replace("&amp;lt;", "&lt;");
                strTxt = strTxt.Replace("&amp;gt;", "&gt;");
            }
            if (strTxt.Contains("<"))
            {
                strTxt = strTxt.Replace("<", "&lt;");
            }
            if (strTxt.Contains(">"))
            {
                strTxt = strTxt.Replace(">", "&gt;");
            }
            if (strTxt.Contains("\""))
            {
                strTxt = strTxt.Replace("\"", "&quot;");
            }
            if (strTxt.Contains("'"))
            {
                strTxt = strTxt.Replace("'", "&apos;");
            }
            return strTxt;
        }

Open in new window

0
 
LVL 20

Expert Comment

by:informaniac
ID: 34914085
Html values?
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 2

Author Comment

by:saloj
ID: 34914095
Hi Sudhakar,
Thanks for ur quick response. I am using your code from previous solution.
on the following code when I have value = "\"";
it is going to replace all  to &quot; how can I filter it?

if (value != "")
           {
          htmlData = htmlData.Replace(value, XmlSpecial(value));
         }
0
 
LVL 11

Accepted Solution

by:
Sudhakar Pulivarthi earned 500 total points
ID: 34914160
Hi Saloj,

Please replace this statement in the code.

xmlData = xmlData.Replace(">" + value + "<", ">" + XmlSpecial(value) + "<");

This will work!!!
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
ID: 34914179

Actually, Since one of the tag value was " (quote) so as part of our previous code, The " was replaced with &quot; in the source string. This caused to replace all the occurance of " even in the tags also.

Hence now the fix made is to include the boundaries also with the value so that the replace happens only in the parsed value in the whole string.

Happy Working... take care
0
 
LVL 2

Author Comment

by:saloj
ID: 34923312
Thanks Sudhakar!
you have great technique!

Happy working!
Take care
0
 
LVL 2

Author Closing Comment

by:saloj
ID: 34923316
EXCELLENT !!!
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction Hi all and welcome to my first article on Experts Exchange. A while ago, someone asked me if i could do some tutorials on object oriented programming. I decided to do them on C#. Now you may ask me, why's that? Well, one of the re…
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
This Micro Tutorial demonstrates using Microsoft Excel pivot tables, how to reverse engineer competitors' marketing strategies through backlinks.
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now