Reading XML using C# and determining the correct elements.

After receiving great help reading an xml file that contains data sources I thought it would be simple to follow it up with reading the corresponding xml file that contains the programs that read this data.  The layout of this xml file is completely different. All I want is the value contained on each "Header Description" line. I have attached the input file. The code I modified below keeps crashing with a bad character message. I can only assume I am not parsing the file correctly. I this case I believe there are no descendants so how do I simply extract the value for Header Description? I also see the message "The ' ' character.hexadecimal value 0x20 cannot be included in a name" Now the name Header Description has space in it.  How do I work around this?

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header Description")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Header Description") == null ? null : tab.Attribute("Header Description").Value,
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null)
                {
                    rc++;
                    tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }

Open in new window

Programs.zip
LVL 1
rwheeler23Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Bill PrewCommented:
Looking very quickly, but XML node names cannot contain spaces.  That being said, it doesn't look like yours do.  In the line of your XML file:

  <Header Description="Main Program">

Open in new window

"Header" is the node name, and "Description" is an attribute of the node.

So you would at least want to change to below, haven't tested this to see if it gets you further...

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Description") == null ? null : tab.Attribute("Description").Value,
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null)
                {
                    rc++;
                    tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }

Open in new window


»bp
0
rwheeler23Author Commented:
There must be some more junk in this file. Now I get a message about invalid character 0x0C.  I can update the if statement to ignore that.
0
Bill PrewCommented:
Well, in the file you posted (in the ZIP) I don't see any 0x0C characters in there (form feed)...


»bp
0
CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

rwheeler23Author Commented:
All I know is the message I get back is

"The hexadecimal value 0x0C is an invalid character. Line 783819"
Is there a way in my if statement to exclude anything that contains something other than [0-9] and [A-Z]?

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Description") == null ? null : tab.Attribute("Description").Value
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null && tab.HeaderDescription.IndexOf(System.Convert.ToChar(0x0C) = 0)
                {
                    try
                    {
                        rc++;
                        tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                    }
                    catch (Exception ex)
                    {
                        string eMsg = "002: Error - Writing to output file: " + ex.Message;
                        if (Model.StackTraceWanted) eMsg += "\n" + ex.StackTrace;
                        MessageBox.Show(eMsg);
                    }

                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }
0
rwheeler23Author Commented:
Could it be that this file is simply too large. It is 55MB. I just tried viewing it using the XML Handler and it comes back saying the file is too large. I tried reading a smaller file and there was no issue.
0
Bill PrewCommented:
Oh, duh, I was looking for that value as is in the bytes of your file.  I see now that you have this syntax twice in your file:

<Text id="19" valUnicode="&#xC;"/>

Open in new window


I suspect that is the problem.  Not sure how to tell you to get around that though.


»bp
0
rwheeler23Author Commented:
Let me ask the vendor why that is there.
0
rwheeler23Author Commented:
If necessary I could pre-parse it with a text field parser(tfp) and strip away these lines. I know the tfp parser works so first I strip away these lines and then proceed reading the xml file.
0
Bill PrewCommented:
Yes, removing them before loading as XML crossed my mind as well.


»bp
0
slightwv (䄆 Netminder) Commented:
>>Let me ask the vendor why that is there.

You can ask but it is valid XML so it shouldn't matter.

As with your previous related questions:  Provide a small test case that shows the issue.  We don't need the 55 Meg XML file or even the 5 Meg one from your previous question.  A 50 line XML file and code that reproduces the error is all we need.
0
rwheeler23Author Commented:
It is definitely the two lines that contain
<Text id="19" valUnicode="&#xC;"/>
that cause the program to crash. If I remove these two lines my program completely parses the file.
All I used was Notepad to replace these strings with blanks. Is there something in C# I can call that would parses this entire file and replace this offending string?
Programs.zip
0
slightwv (䄆 Netminder) Commented:
I stand corrected.  I should have dug deeper than just the syntax and looked in the actual spec.

The XML 1.0 spec doesn't allow control characters:
https://en.wikipedia.org/wiki/Valid_characters_in_XML

So, the XML technically isn't valid.

If you have a way to strip them before processing, then go for it.  That would be easier than trying to hack something else together.
0
rwheeler23Author Commented:
The vendor did not want to talk about this so I will strip away these characters.
0
slightwv (䄆 Netminder) Commented:
>>The vendor did not want to talk about this

lol.... if they produce invalid XML, no wonder.
0
rwheeler23Author Commented:
I asked someone on the inside what was going on and the reply was:

"The parser runs another function to convert element/attribute content to an xml valid string.
Linq won't help you here. You need and xml parser with the ability to convert the content as part of the parsing operation."

What does this mean and how would I do this?
0
Bill PrewCommented:
Have you tried parsing it using XMLDocument and XPath rather than LINQ?


»bp
0
rwheeler23Author Commented:
I need to educate myself on the differences between the two. There is also something else I need to figure out how to do and that is how to determine which node I am on. What I mean by this is I have discovered that under each Header Description there could be another header description and then another under that and so on and so on. In looking at the source xml file I do not see how to differentiate one node from another.
0
slightwv (䄆 Netminder) Commented:
I would love to have a chat with your inside person.  Unless I'm reading the XML 1.0 spec wrong, an encoded formFeed isn't valid.

But like proper html tags, some parsers seem to be more forgiving than others.

That said and borrowing some code from the other question, XMLDocument does appear to not care.

I'm not sure exactly what you want to extract but here is a stub that loops through ALL the PropertyList nodes and pulls the models.

I ran it against your 50 Meg file with no problems.

using System;
using System.Collections.Generic;
using System.Xml;

public class Program
{
	public static void Main()
	{
	 	XmlDocument xml = new XmlDocument();
	 	xml.Load("sample.xml");

	 	XmlNodeList xnList = xml.SelectNodes("//PropertyList");
		Console.WriteLine("Nodes found: " + xnList.Count);

		if (xnList != null)
			foreach (XmlNode test in xnList)
			{
				if (test.Attributes != null)
					foreach (XmlAttribute prop in test.Attributes)
					{
						switch (prop.Name)
						{
							case "model":
								Console.WriteLine("Model: " + prop.Value);
								break;
						}
					}
			}
	}
}

Open in new window



>>I do not see how to differentiate one node from another.

The XPath to get there?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
rwheeler23Author Commented:
This is proprietary software written by a company outside of the US. What I am trying to do is dump out the data and program dictionaries. As you can see from the xml files 95% of what is in the files is useless for what I want. I want to extract the table and program names. Table names works fine but the program names have these invalid characters plus a program can consist of tasks and n number of sub tasks. The goal is to extract only the program names and not the code within each program. When I get back on Tuesday I will apply your code snippet.
0
slightwv (䄆 Netminder) Commented:
>>want to extract the table and program names

I will need expected results from the 55 Meg XML file you provided.

>>When I get back on Tuesday I will apply your code snippet.

As long as you have windows at home, there is no real need to wait.  You likely have everything you need to compile and test the code I provided.

I gave you the steps in the previous question.
0
rwheeler23Author Commented:
Thanks
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
XML

From novice to tech pro — start learning today.