Solved

parse illegal characters

Posted on 2011-02-16
10
342 Views
Last Modified: 2012-05-11
Hi Experts,
I have following html text as a sample, which I need to parse for illegal characters.
But the parse should be only for html values and not for html tags.

eg.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:mce="mce">
<body>
<p class ="test">afa sf  & as ffaf</p><p class="ss">adssfa fa sfasf <br />"<i>sds</i>"</p>
</body>
</html>


any help would be appreciate

Thanks
0
Comment
Question by:saloj
10 Comments
 
LVL 9

Expert Comment

by:s_chilkury
Comment Utility
Check the following:
http://www.codeproject.com/Articles/57176/Parsing-HTML-Tags-in-Csharp.aspx

Also, you can use HTMLAgility Pack which does the same.
0
 
LVL 8

Expert Comment

by:jimsweb
Comment Utility
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
Comment Utility
Hi Saloj,

i am attaching the code which will parse the values present in the HTML data.
Please look it might be useful. In need of further expantion in code will do. What are those ilegal chars ur looking for? and what you want to do with those chars?
public static string ParseTheValuesInHTML(string htmlData)
        {
            int startIndex = 0;

            if (htmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = htmlData.IndexOf("<", startIndex + 1);
                    int lastIndex = nextCount - htmlData.IndexOf(">", startIndex);

                    if (nextCount > 0 && lastIndex > 0)
                    {
                        // Find the text between > and <. 
                        string value = htmlData.Substring(htmlData.IndexOf(">", startIndex) + 1, lastIndex - 1);

                        if (value != "")
                        {
                            // The value string is what you are looking for between HTML tags.
                            // Here you can verify any illegal chars present in it and process as u want.
                        }
                    }

                    // Find the next tag.
                    startIndex = htmlData.IndexOf(">", startIndex + 1);
                }
            }

            return htmlData;
        }

Open in new window

0
 
LVL 2

Author Comment

by:saloj
Comment Utility
Hi EE, i have following string, when I parse it with the following code, it parse all data including the html tags also. But I only need to parse value.
can anybody help me on the code.


string strA = "<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:mce=\"mce\"><head><style></style><title>sadfafsfsdf asdf asd sadf saf</title></head><body class=\"hhh\"><p style=\"margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0pt\">content has posted the latest <a href=\"http://www.xyz.com/media/763/Dynacor_Gold_Mines_TSX:_DNG_News_Alert/\" target=\"_blank\">video</a> \"News Alert\" for Dynacor Gold Mines Inc. If the link is unavailable, please visit <a href=\"http://www.xyz.com/\" target=\"_blank\">www.xyz.com</a> and enter \"Dynacor\" in the search box.</p><p class=\"hhh\" style=\"margin-top: 0pt; margin-right: 0pt;  margin-left: 0pt; margin-bottom:0pt;\">According to Metanor&apos;s press release:<br class=\"hhh\" />\"<i class=\"hhh\">S of US$500 (the \"Per Ounce Payments\") and the then prevailing market price per ounce of gold. Sandstorm will (i) US$5 million upon signing of the </i><i class=\"hhh\">agreement, (ii) US$9 million,once Metan</i>.\"</p></body></html>";
ParseTheValuesInXML(strA);

public static string ParseTheValuesInXML(string xmlData)
        {
            int startIndex = 0;

            if (xmlData.Contains(">"))
            {
                // Process till last occurance of the <.
                while (startIndex >= 0)
                {
                    int nextCount = xmlData.IndexOf("<", startIndex + 1);
                    //int lastIndex = nextCount - xmlData.IndexOf(">", startIndex);

                    //if (nextCount > 0 && lastIndex > 0)
                    if (nextCount > 0)
                    {
                        // Find the text between > and <. 
                        string value = xmlData.Substring(xmlData.IndexOf(">", startIndex) + 1, nextCount - xmlData.IndexOf(">", startIndex) - 1);
                        if (value != "")
                        {                            
                            // Replace the text with xml special data.
                            xmlData = xmlData.Replace(value, XmlSpecial(value));
                        }
                    }

                    // Find the next tag.
                    startIndex = xmlData.IndexOf(">", startIndex + 1);
                }
            }
            return xmlData;
        }


    private static string XmlSpecial(string strTxt)
        {
            if (strTxt.Contains("&"))
            {
                // Convert value to all special xml character.
                strTxt = strTxt.Replace("&amp;", "&");
                // Start replacing the specials.
                strTxt = strTxt.Replace("&", "&amp;");
                strTxt = strTxt.Replace("&amp;apos;", "&apos;");
                strTxt = strTxt.Replace("&amp;quot;", "&quot;");
                strTxt = strTxt.Replace("&amp;lt;", "&lt;");
                strTxt = strTxt.Replace("&amp;gt;", "&gt;");
            }
            if (strTxt.Contains("<"))
            {
                strTxt = strTxt.Replace("<", "&lt;");
            }
            if (strTxt.Contains(">"))
            {
                strTxt = strTxt.Replace(">", "&gt;");
            }
            if (strTxt.Contains("\""))
            {
                strTxt = strTxt.Replace("\"", "&quot;");
            }
            if (strTxt.Contains("'"))
            {
                strTxt = strTxt.Replace("'", "&apos;");
            }
            return strTxt;
        }

Open in new window

0
 
LVL 20

Expert Comment

by:informaniac
Comment Utility
Html values?
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 2

Author Comment

by:saloj
Comment Utility
Hi Sudhakar,
Thanks for ur quick response. I am using your code from previous solution.
on the following code when I have value = "\"";
it is going to replace all  to &quot; how can I filter it?

if (value != "")
           {
          htmlData = htmlData.Replace(value, XmlSpecial(value));
         }
0
 
LVL 11

Accepted Solution

by:
Sudhakar Pulivarthi earned 500 total points
Comment Utility
Hi Saloj,

Please replace this statement in the code.

xmlData = xmlData.Replace(">" + value + "<", ">" + XmlSpecial(value) + "<");

This will work!!!
0
 
LVL 11

Expert Comment

by:Sudhakar Pulivarthi
Comment Utility

Actually, Since one of the tag value was " (quote) so as part of our previous code, The " was replaced with &quot; in the source string. This caused to replace all the occurance of " even in the tags also.

Hence now the fix made is to include the boundaries also with the value so that the replace happens only in the parsed value in the whole string.

Happy Working... take care
0
 
LVL 2

Author Comment

by:saloj
Comment Utility
Thanks Sudhakar!
you have great technique!

Happy working!
Take care
0
 
LVL 2

Author Closing Comment

by:saloj
Comment Utility
EXCELLENT !!!
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
WPF issue with Trigger 2 65
How to read XML file attributes... 17 39
How do I get the id from URL? 19 46
Image(7) 1 33
Extention Methods in C# 3.0 by Ivo Stoykov C# 3.0 offers extension methods. They allow extending existing classes without changing the class's source code or relying on inheritance. These are static methods invoked as instance method. This…
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now