Link to home
Start Free TrialLog in
Avatar of dkpepsict
dkpepsict

asked on

Parse a complex xml document vb.net

I am trying to parse a bit more complex xml document than usual in my vb.net application, and I just don’t know how to get out the data of the xml which I need to insert into my SQL database.
I just need to extract the data and have it in a DataTable or something I can work with so that I can insert the data into my sql database.
Here is an example of the xml document:
<?xml version="1.0" encoding="UTF-8"?>
<fs_response status="success" timestamp="2012-02-07 22:53:43">
  <headings>
    <heading for="result_status">Result Status</heading>
    <heading for="date_start">Start Time</heading>
    <heading for="date_finish">Finish Time</heading>
    <heading for="date_update">Update Time</heading>
    <heading for="user_ip">User's IP Address</heading>
    <heading for="user_browser">User's Browser</heading>
    <heading for="user_os">User's Operating System</heading>
    <heading for="user_referrer">User's Referring URL</heading>
    <heading for="payment_status">Payment Status</heading>
    <heading for="payment_amount">Payment Amount Paid</heading>
    <heading for="4">Firstname</heading>
    <heading for="5">Lastname</heading>
    <heading for="6">Address</heading>
    <heading for="8">City</heading>
    <heading for="10">ZipCode</heading>
  </headings>
  <results>
    <result id="6284420">
      <metas>
        <meta id="result_status">Complete</meta>
        <meta id="date_start">2012-02-04 10:35:00</meta>
        <meta id="date_finish">2012-02-04 10:44:00</meta>
        <meta id="date_update">2012-02-04 10:44:00</meta>
        <meta id="user_ip">8.8.8.8</meta>
        <meta id="user_browser">Firefox</meta>
        <meta id="user_os">Mac</meta>
        <meta id="user_referrer">http://urladdress</meta>
        <meta id="payment_amount">0.00</meta>
      </metas>
      <items>
        <item id="4" index="4" type="text">
          <value>John</value>
        </item>
        <item id="5" index="5" type="text">
          <value>Doe</value>
        </item>
        <item id="6" index="6" type="text">
          <value>Somestreet 1</value>
        </item>
        <item id="8" index="7" type="text">
          <value>Longville</value>
        </item>
        <item id="10" index="8" type="text">
          <value>6000</value>
        </item>
      </items>
    </result>
    <result id="6282110">
      <metas>
        <meta id="result_status">Complete</meta>
        <meta id="date_start">2012-02-06 19:18:17</meta>
        <meta id="date_finish">2012-02-06 19:28:27</meta>
        <meta id="date_update">2012-02-06 19:28:27</meta>
        <meta id="user_ip">8.8.8.8</meta>
        <meta id="user_browser">Chrome</meta>
        <meta id="user_os">Windows</meta>
        <meta id="user_referrer">http://simeurl</meta>
        <meta id="payment_amount">0.00</meta>
      </metas>
      <items>
        <item id="4" index="4" type="text">
          <value>Jane</value>
        </item>
        <item id="5" index="5" type="text">
          <value>Doe</value>
        </item>
        <item id="6" index="6" type="text">
          <value>Somestreet 2</value>
        </item>
        <item id="8" index="7" type="text">
          <value>Smallville</value>
        </item>
        <item id="10" index="8" type="text">
          <value>7000</value>
        </item>
      </items>
    </result>
  </results>
</fs_response>

Open in new window



This is just a small sample of how the structure is, I have a lot more <heading> tags with generic numbers in the “for” attribute.
For each <result> in <results> I need the ID attribute content of the <result> tag and all meta data and all of the content of <item><value>, where I need to know which heading it relates to.
So at the end I should end up with two tables in my SQL database.
One that contains the <meta> data and the id attribute of the result, and one table with the items content with columns that corresponding to the generic <heading> tags and also the id attribute of the <result>.
The heading will be a constant, so it will not grow than what I have at the moment.

So how can I approach parsing this xml document?
I hope somebody can give me some hints as I have no idea where to start from, where the end result is that has that my SQL database has those two tables filled with data from that xml document.
Avatar of Nasir Razzaq
Nasir Razzaq
Flag of United Kingdom of Great Britain and Northern Ireland image

Have you tried using the ReadXML method of datatable? Does it parse the XML correctly?
Avatar of dkpepsict
dkpepsict

ASKER

No that one I didn't know about, I thought that the easiest way was to parse with linq to xml, but ReadXML seems a better choice for this xml document.

Would I need to loop the meta data and items data separately?
I am not sure what structure datatable would get. Try it. You may have to use dataset.readxml as this file may result in multiple tables.
Okay, now I come maybe with a bit of a stupid question, but the xml document I get from a REST API, how can I pass the xml document into dataset.xmlread from a streamreader without needing to save the document to the disk first?
For your heading use:
Dim HDS = From x In xmlString.Root.Descendants("headings").Descendants("heading") Select x.Value

Open in new window

where xmlString is your xml document (parse.xelement)
ReadXML has many overloads

User generated image
ASKER CERTIFIED SOLUTION
Avatar of nepaluz
nepaluz
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks a lot for the help nepaluz!!

You did LOT more than what I asked for. (you did everything for me :-) )
You did LOT more than what I asked for. (you did everything for me :-) )
You are welcome however that may be a good thing, but it can also be a bad thing!
Though the code is commented, it is not entirely explained, and I hope you understand HOW the code works, if not, just drop a line here on what you need an explanation for.
I agree with you.

I understand how it works, would have probably taken me some time to make something similar my self hehe..

I have thought a new problem, I tried to run it on the full sized xml document, and I get an error on the line:
Array.ForEach(nHDS.ToArray, Sub(x) HDataTable.Columns.Add(New DataColumn With {.ColumnName = x.hValue, .AllowDBNull = True}))

It seems that some of the headings have the same value, and that it can't create more columns with the same name.
I get this error:
A column named 'Licensenumber' already belongs to this DataTable.

Is there a way to fix this?
Not sure if this will work for you (though it works for the file you gave). Try it and see:
Dim nHDS = (From w In xmlString.Root.Descendants("headings").Descendants("heading") Where IsNumeric(w.Attribute("for").Value) Select
           New With {.hName = w.Attribute("for").Value, .hValue = w.Value}).Distinct

Open in new window

Or you could encapsulate the For Each statement in a try ... Catch, e.g
replace
Array.ForEach(nHDS.ToArray, Sub(x) HDataTable.Columns.Add(New DataColumn With {.ColumnName = x.hValue, .AllowDBNull = True}))

Open in new window

with
For Each d In nHDS
    Try
         HDataTable.Columns.Add(New DataColumn With {.ColumnName = d.hValue, .AllowDBNull = True})
    Catch ex As Exception
        'this will trap duplicate columns and still continue running
    End Try
Next

Open in new window

Yeah, as I said, the xml I posted is a small part, the original file has more headings.
But no that didn't help.

These are the headings of the original document and if you notice some of them have the same value:
<heading for="result_status">Result Status</heading>
<heading for="date_start">Start Time</heading>
<heading for="date_finish">Finish Time</heading>
<heading for="date_update">Update Time</heading>
<heading for="user_ip">User's IP Address</heading>
<heading for="user_browser">User's Browser</heading>
<heading for="user_os">User's Operating System</heading>
<heading for="user_referrer">User's Referring URL</heading>
<heading for="payment_status">Payment Status</heading>
<heading for="payment_amount">Payment Amount Paid</heading>
<heading for="4">Fornavn</heading>
<heading for="5">Efternavn</heading>
<heading for="6">Adresse</heading>
<heading for="8">By</heading>
<heading for="10">Postnummer</heading>
<heading for="11">Primært telefonnummer</heading>
<heading for="37">Andet telefonnummer</heading>
<heading for="12">e-mail adresse</heading>
<heading for="21">Registreringsnummer</heading>
<heading for="23">Stelnummer</heading>
<heading for="22">Første registreringsdato</heading>
<heading for="24">Bilmærke</heading>
<heading for="25">Model</heading>
<heading for="28">Farve</heading>
<heading for="29">Forsikringsselskab</heading>
<heading for="43">Er køretøjet kaskoforsikret</heading>
<heading for="70">Kilometerstand</heading>
<heading for="17">Jeg har læst betingelserne for det bestilte abonnement.</heading>
<heading for="39">Hvor har du autohjælp idag</heading>
<heading for="44">Skal Redningskorps Danmark opsige Deres nuværende aftale</heading>
<heading for="69">Eventuelt nuværende abonnementsnummer</heading>
<heading for="30">Bemærkninger til min bestilling</heading>
<heading for="42">Rabatkode</heading>
<heading for="45">Ønsker De også et trailerabonnement til almindelig trailer (550 kg totalvægt) til 98 kroner</heading>
<heading for="60">Registreringsnummer</heading>
<heading for="61">Stelnummer</heading>
<heading for="62">Første registreringsdato</heading>
<heading for="63">Mærke</heading>
<heading for="64">Totalvægt op til</heading>
<heading for="65">Forsikringsselskab</heading>
<heading for="66">Er traileren kaskoforsikret</heading>
<heading for="67">Trailertype</heading>

Open in new window

The try catch didn't help.
Even if there are some duplicates, they are still used as the "for" attribute has another value.

So with try catch if fails when it tries to add a row to the duplicate column that didn't get created.
OK, here is what I have:
There are three duplicates, and if this is the final list of headers, then we can redesign the code to handle these.
User generated imageWhat you need is to assign the two numbers that appear in the for attributes of these to the same column name. Are you happy to do that?
it works fine if I use the for attribute.

So changing hValue to hName made it work.

This is fine as it's even easier for me to add the columns in the columns of my sql tables.

Thanks a lot for the help!!!
The reason I had steered away from using the attribute values that are numeric as column headers was that they were not descriptive, however, that you are happy to use them is the perfect fix.
Ok, now it's starting to get a bit embarrassing but I have a new problem now.
My code is only giving me the 100 results, but I know that there are 172 items.

This is my current code:
Imports System.IO
Imports System.Net

Public Class Form1

    Dim request As HttpWebRequest
    Dim response As HttpWebResponse = Nothing
    Dim reader As StreamReader
    Dim xmlUrl As String = "https://urltoRESTAPI"

    Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
        Try
            ' Create the web request  
            request = DirectCast(WebRequest.Create(xmlUrl), HttpWebRequest)

            ' Get response  
            response = DirectCast(request.GetResponse(), HttpWebResponse)

            ' Get the response stream into a reader  
            reader = New StreamReader(response.GetResponseStream())

            Dim xmlString As XDocument
            xmlString = XDocument.Load(reader)


            Dim HDS = From x In xmlString.Root.Descendants("headings").Descendants("heading")
                    Where Not IsNumeric(x.Attribute("for").Value) Select x.Attribute("for").Value
            Dim nHDS = From w In xmlString.Root.Descendants("headings").Descendants("heading") Where IsNumeric(w.Attribute("for").Value) Select
                       New With {.hName = w.Attribute("for").Value, .hValue = w.Value}

            'declare the datatable
            Dim HDataTable As DataTable = New DataTable

            'first add the transaction ID column
            HDataTable.Columns.Add(New DataColumn With {.ColumnName = "TranID", .AllowDBNull = False, .Unique = True})

            'then add all the normal columns, i.e no numeric attributes
            Array.ForEach(HDS.ToArray, Sub(x) HDataTable.Columns.Add(New DataColumn With {.ColumnName = x, .AllowDBNull = True}))

            'add the columns with numeric attributes
            Array.ForEach(nHDS.ToArray, Sub(x) HDataTable.Columns.Add(New DataColumn With {.ColumnName = x.hName, .AllowDBNull = True}))

            'here filter the results from the xml
            Dim Rests = From x In xmlString.Root.Descendants("results").Descendants("result") Select x
            'iterate through the results' sets
            For Each x In Rests
                'this is the transaction ID
                Dim mID = x.Attribute("id").Value
                'get the values of metas
                Dim Metas = (From q In x.Descendants("metas").Descendants("meta")
                             Select New With {.mName = q.Attribute("id").Value, .mValue = q.Value})
                'declare the datarow and start populating it
                Dim tCol As DataRow = HDataTable.NewRow()
                tCol.Item("TranID") = mID
                'this withthe metas
                Array.ForEach(Metas.ToArray, Sub(y) tCol.Item(y.mName) = y.mValue)
                'here try and resolve the headers with numbered attributes
                Dim Items = (From q In x.Descendants("items").Descendants("item")
                             Select New With {.mID = q.Attribute("id").Value, .mValue = q.Descendants("value").Value})
                For Each u In Items
                    Dim j = (From c In nHDS Where c.hName = u.mID Select c).FirstOrDefault
                    tCol.Item(j.hName) = u.mValue
                Next
                'finally add the row to the table
                HDataTable.Rows.Add(tCol)
                Label1.Text = HDataTable.Rows.Count
            Next
            'I added this to demonstrate, you'll need to add a datagridview to your form to use it
            Me.DataGridView1.DataSource = HDataTable
        Finally
            If Not response Is Nothing Then response.Close()
        End Try
    End Sub
End Class

Open in new window

Its goingto be hard to trace that without your xml file! Any chance of providing it?
Inserta Catch in you try and see if you get any errors first though.
I understand, but the xml document has a lot of confidential data, so I can't post it in here.

I don't get any errors in the catch statement.
OK I have tried to whittle down the code (removing the lines for the numeric ids plus the comments etc) and have this left:
Dim HDS = (From x In xmlString.Root.Descendants("headings").Descendants("heading") Select x.Attribute("for").Value).ToList
Dim HDataTable As DataTable = New DataTable
HDataTable.Columns.Add(New DataColumn With {.ColumnName = "TranID", .AllowDBNull = False, .Unique = True})
Array.ForEach(HDS.ToArray, Sub(x) HDataTable.Columns.Add(New DataColumn With {.ColumnName = x, .AllowDBNull = True}))
Dim Rests = (From x In xmlString.Root.Descendants("results").Descendants("result") Select x).ToList
For Each x In Rests
    Dim mID = x.Attribute("id").Value
    Dim Metas = (From q In x.Descendants("metas").Descendants("meta") Select New With {.mName = q.Attribute("id").Value, .mValue = q.Value}).ToArray
    Dim tCol As DataRow = HDataTable.NewRow()
    tCol.Item("TranID") = mID
    Array.ForEach(Metas, Sub(y) tCol.Item(y.mName) = y.mValue)
    Dim Items = (From q In x.Descendants("items").Descendants("item") Select New With {.mID = q.Attribute("id").Value, .mValue = q.Descendants("value").Value}).ToArray
    Array.ForEach(Items, Sub(y) tCol.Item(y.mID) = y.mValue)
    HDataTable.Rows.Add(tCol)
Next
Me.DataGridView1.DataSource = HDataTable

Open in new window

Notable is I have removed the dynamicity in the code by applying ToList and ToArray methods to all LINQ queries. That should remove any hidden errors that may be causing your issues. Let me know what you yield.
Okay this is really strange.

HDataTable.Rows.Count still gives me 100.

I even counted the datagrid manually to be sure.
instert this at the begining of your code and place a breakpoint after it to see how many results' items there are.
Dim CountResults = (From x In xmlString.Root.Descendants("results").Descendants("result") Select x).Count

Open in new window

Here is what I have:
User generated image
Yeah just had the same idea to count the results.

there is the root of the problem.

the REST API only gives me 100 results.

But I know that there are 172 results, and also in the bottom of the xml this is also mentioned:
<total_results>
172</total_results></fs_response>
okay I see in the documentation now that the API gives me a maximum output of 100 results.
How can I then make sure that I get all of the data?

This is the documentation for the service that I am using:
https://fs23.formsite.com/documentation/api/
Very well. I am not aware of any limitation in the number of records that LINQ returns, and therefore without the actual XML file to look at, and if the count also says there are 100 records, then I have to believe the count and take what is written inthe file with a pinch of salt!

Having said that, Isuppose what is left to say is I have taken you as far as practicable, unless something changes 9or comes to light).
Yeah, I think I found that the solution for the last obstacle is.

As I said the REST API is returning max. 100 results per request, so I need to use the fs_page parameter to fetch all xml pages.
Honestly, I did not read the API documentation (would be digressing from the question), and thus would not be in a position to comment on that, however, it is good news that you have found the solution to the API bottleneck. My code suggestion will handle either, just give it an xml!
yeah, this part is simple, thanks a lot for your help!