Link to home
Start Free TrialLog in
Avatar of rwheeler23
rwheeler23Flag for United States of America

asked on

Reading XML using C# and determining the correct elements.

After receiving great help reading an xml file that contains data sources I thought it would be simple to follow it up with reading the corresponding xml file that contains the programs that read this data.  The layout of this xml file is completely different. All I want is the value contained on each "Header Description" line. I have attached the input file. The code I modified below keeps crashing with a bad character message. I can only assume I am not parsing the file correctly. I this case I believe there are no descendants so how do I simply extract the value for Header Description? I also see the message "The ' ' character.hexadecimal value 0x20 cannot be included in a name" Now the name Header Description has space in it.  How do I work around this?

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header Description")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Header Description") == null ? null : tab.Attribute("Header Description").Value,
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null)
                {
                    rc++;
                    tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }

Open in new window

Programs.zip
Avatar of Bill Prew
Bill Prew

Looking very quickly, but XML node names cannot contain spaces.  That being said, it doesn't look like yours do.  In the line of your XML file:

  <Header Description="Main Program">

Open in new window

"Header" is the node name, and "Description" is an attribute of the node.

So you would at least want to change to below, haven't tested this to see if it gets you further...

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Description") == null ? null : tab.Attribute("Description").Value,
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null)
                {
                    rc++;
                    tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }

Open in new window


»bp
Avatar of rwheeler23

ASKER

There must be some more junk in this file. Now I get a message about invalid character 0x0C.  I can update the if statement to ignore that.
Well, in the file you posted (in the ZIP) I don't see any 0x0C characters in there (form feed)...


»bp
All I know is the message I get back is

"The hexadecimal value 0x0C is an invalid character. Line 783819"
Is there a way in my if statement to exclude anything that contains something other than [0-9] and [A-Z]?

        private void btnExtract_Click(object sender, EventArgs e)
        {
            rc = 0;

            TextWriter tw = new StreamWriter(OutputFilename);

            XDocument xml = XDocument.Load(InputFilename);

            var tables = from tab in xml.Descendants("Header")
                         select new
                         {
                             HeaderDescription = tab.Attribute("Description") == null ? null : tab.Attribute("Description").Value
                         };

            foreach (var tab in tables)
            {
                //
                /* Console.WriteLine("Physical Name = " + tab.PhysicalName + ", Data Source = " + tab.data_source + ", Name=" + tab.name); */

                if (tab.HeaderDescription != null && tab.HeaderDescription.IndexOf(System.Convert.ToChar(0x0C) = 0)
                {
                    try
                    {
                        rc++;
                        tw.WriteLine(rc.ToString() + "-Physical Name = " + tab.HeaderDescription);
                    }
                    catch (Exception ex)
                    {
                        string eMsg = "002: Error - Writing to output file: " + ex.Message;
                        if (Model.StackTraceWanted) eMsg += "\n" + ex.StackTrace;
                        MessageBox.Show(eMsg);
                    }

                }
            }

            tw.Close();

            MessageBox.Show("Program extraction complete.");
        }
Could it be that this file is simply too large. It is 55MB. I just tried viewing it using the XML Handler and it comes back saying the file is too large. I tried reading a smaller file and there was no issue.
Oh, duh, I was looking for that value as is in the bytes of your file.  I see now that you have this syntax twice in your file:

<Text id="19" valUnicode="&#xC;"/>

Open in new window


I suspect that is the problem.  Not sure how to tell you to get around that though.


»bp
Let me ask the vendor why that is there.
If necessary I could pre-parse it with a text field parser(tfp) and strip away these lines. I know the tfp parser works so first I strip away these lines and then proceed reading the xml file.
Yes, removing them before loading as XML crossed my mind as well.


»bp
>>Let me ask the vendor why that is there.

You can ask but it is valid XML so it shouldn't matter.

As with your previous related questions:  Provide a small test case that shows the issue.  We don't need the 55 Meg XML file or even the 5 Meg one from your previous question.  A 50 line XML file and code that reproduces the error is all we need.
It is definitely the two lines that contain
<Text id="19" valUnicode="&#xC;"/>
that cause the program to crash. If I remove these two lines my program completely parses the file.
All I used was Notepad to replace these strings with blanks. Is there something in C# I can call that would parses this entire file and replace this offending string?
Programs.zip
I stand corrected.  I should have dug deeper than just the syntax and looked in the actual spec.

The XML 1.0 spec doesn't allow control characters:
https://en.wikipedia.org/wiki/Valid_characters_in_XML

So, the XML technically isn't valid.

If you have a way to strip them before processing, then go for it.  That would be easier than trying to hack something else together.
The vendor did not want to talk about this so I will strip away these characters.
>>The vendor did not want to talk about this

lol.... if they produce invalid XML, no wonder.
I asked someone on the inside what was going on and the reply was:

"The parser runs another function to convert element/attribute content to an xml valid string.
Linq won't help you here. You need and xml parser with the ability to convert the content as part of the parsing operation."

What does this mean and how would I do this?
Have you tried parsing it using XMLDocument and XPath rather than LINQ?


»bp
I need to educate myself on the differences between the two. There is also something else I need to figure out how to do and that is how to determine which node I am on. What I mean by this is I have discovered that under each Header Description there could be another header description and then another under that and so on and so on. In looking at the source xml file I do not see how to differentiate one node from another.
ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This is proprietary software written by a company outside of the US. What I am trying to do is dump out the data and program dictionaries. As you can see from the xml files 95% of what is in the files is useless for what I want. I want to extract the table and program names. Table names works fine but the program names have these invalid characters plus a program can consist of tasks and n number of sub tasks. The goal is to extract only the program names and not the code within each program. When I get back on Tuesday I will apply your code snippet.
>>want to extract the table and program names

I will need expected results from the 55 Meg XML file you provided.

>>When I get back on Tuesday I will apply your code snippet.

As long as you have windows at home, there is no real need to wait.  You likely have everything you need to compile and test the code I provided.

I gave you the steps in the previous question.
Thanks