asked on

asp.net c# Parse XML Import gets slower

Hi,

I have an xml import function that gets slower the more records that it imports, the source file is very large and I have split the import function to loop through 500 rows at a time so the page does not timeout. When the function starts it takes 1-2 seconds per page(500 records) but after it is through 1000 pages it takes about 1-2 minutes per page, the full impoirt is about 3000 pages
I want to be able to speed up the import function.

I have added my code that I am using below;

Thanks

Chris

public static void ImportXML(int nStep)
    {
 
        _nNrRowsProccessed = 0;
        long nNrRowsToProcess = 0;
        try
        {
            nNrRowsToProcess = Convert.ToInt64(ConfigurationManager.AppSettings["step_rows"]);
        }
        catch
        {
            nNrRowsToProcess = 500;
        }
 
        long nStart = (nStep-1) * nNrRowsToProcess;
 
        // starting from nStart+1 TO nNrRowsToProcess+nStart, insert all vacancies into MC.
 
        XmlTextReader xmlreader = new XmlTextReader(ConfigurationManager.AppSettings["xmlFile"]);
 
        string strAgencyPropertyID = null;
        string strArrivalTime;
        char strAvailability;
 
        _allRowsProcessed = true;
 
        try
        {
            MasterCalendar_DB.OpenMySQLConnection();
 
            while (xmlreader.Read())
            {
                if (xmlreader.NodeType == XmlNodeType.Element)
                {
                    if (xmlreader.Name == "HouseCode")
                    {
                        xmlreader.Read(); // read the text
                        strAgencyPropertyID = xmlreader.Value;
                    }
 
                    if (xmlreader.Name == "ArrivalDate")
                    {
                        xmlreader.Read(); // read the text
                        strArrivalTime = xmlreader.Value;
                        Int64 nDate = 0;
                        if (strArrivalTime == "")
                        {
                            strAvailability = 'N';
                        }
                        else
                        {
                            strAvailability = 'Y';
                            DateTime dt = DateTime.Parse(strArrivalTime);
                            nDate = (Int64)Util.ConvertToUnixTimestamp(dt);
                        }
                        if (strAgencyPropertyID != null)
                        {
                            _nNrRowsProccessed++;
                            if (_nNrRowsProccessed <= nStart)
                            {
                                continue; // skip all vacancies util we have the starting position
                            }
                            MasterCalendar_DB.Insert(strAgencyPropertyID, nDate, strAvailability);
 
                            if (_nNrRowsProccessed >= nStart + nNrRowsToProcess)
                            {
                                // all needed rows were processed.
                                _allRowsProcessed = false;
                                break;
                            }
                        }
                    }
                }
            }
        }
        catch (Exception ex)
        {
            throw ex;
        }
        finally
        {
            MasterCalendar_DB.CloseMySQLConnection();
            xmlreader.Close();
        }
    }

Open in new window

MsShadow

have you tried making short db connections?

MasterCalendar_DB.OpenMySQLConnection();
MasterCalendar_DB.Insert(strAgencyPropertyID, nDate, strAvailability);
MasterCalendar_DB.CloseMySQLConnection();

instead of keeping the dbconnection open the whole time?

chrispaton

ASKER

I get an error after the seventh page loop
Can't connect to MySQL server on 'localhost' (10048)
Looks like it is trying to open and close too many connections in a short space of time.
Is the import storing the data in a cache when it has read the xml file to importing the data, do I need to clear the cache between imports.

abel

Can you try the following:

run the loop without actually updating the database (comment out the insert)
run the loop with updating the database, but without parsing the XML (i.e, just loop 3000 times)
run the loop in either case always without continuously opening the connection, keep it open all the time is usually much faster.

when you are running the code, is it running with MySQL on the same machine? A good thing to try to speed up database updates is using a batchupdate instead. This collects all updates in memory and then passes them on to the database in one push.

-- Abel --

chrispaton

ASKER

I have run the loop with out inserting and this runs the same as the original, ie gets slower and throws cant connect to mysql server after seventh page if I add the open close in the loop.
By running with the open and close in the loop the individual pages are much slower over all having to open and close the connection for each loop.
The code is run on the same machine as the mySQL (windows web server 2003)
The code did run on a batch update originally but the record import can be between 3-4 million records and the source xml file is approx 900mb and the IE page timed out or the import process crashed before the update completed so the code was changed to loop through so many records at a time which cured that problem

abel

Please DO NOT open/close inside the loop. As you found out, that will only give trouble... Just open before the loop.

It seems to me that somehow the XML is troubling the code execution speed, but since you are going through the XML sequentially, I don't see (atm) why it is slowing down so much. Making sure this is the case, running the code with only looping through the XML and not actually writing anything to the database (or opening connections) is crucial to trim down the cause.

How large is the XML if I may ask (in terms of bytes and elements and nesting)?

chrispaton

ASKER

Here is an example of the XML;

<Houses>
<House>
<HouseCode>XX-12345-02</HouseCode>
- <PlanningRecords> -
<Planning>
<Period>1w</Period>
<ArrivalDate>2008-06-21</ArrivalDate>
<ArrivalTimeFrom>17:00:00</ArrivalTimeFrom>
<ArrivalTimeUntil>20:00:00</ArrivalTimeUntil>
<DepartureDate>2008-06-28</DepartureDate>
<DepartureTime>12:00:00</DepartureTime>
<BookingOnRequest>Yes</BookingOnRequest>
<RentPrice>1578</RentPrice>
<RentPriceExclDiscount>1578</RentPriceExclDiscount>
<CS>GjbuoxyjlknciamlyaaclcfkgfozCkjR</CS>
</Planning> -
<Planning>
<Period>1w</Period>
<ArrivalDate>2008-06-28</ArrivalDate>
<ArrivalTimeFrom>17:00:00</ArrivalTimeFrom>
<ArrivalTimeUntil>20:00:00</ArrivalTimeUntil>
<DepartureDate>2008-07-05</DepartureDate>
<DepartureTime>12:00:00</DepartureTime>
<BookingOnRequest>Yes</BookingOnRequest>
<RentPrice>1964</RentPrice>
<RentPriceExclDiscount>1964</RentPriceExclDiscount>
<CS>XiaKpcixhnkjdwKWQlhjlatldjkfadaE</CS>
</Planning>

abel

That looks quite flat. Is that one record? And do you have that 3000x? Though that is a lot, it should not be a problem for xmlReader, I've seen worse. Can it be that you are inside a memory swapping problem, i.e., that your app is running short of available GC memory?

you can do a Debug.WriteLine(GC.GetTotalMemory(false)) to find out how fast the memory is increasing during the loop.

chrispaton

ASKER

That is just a sample from one record there are about 13000 house codes, each having between 200-300 Arrival dates in the full xml file, the file is quite simple just a lot of it. The function reads through and imports the file it is just very strange that it gets noticeably slower the further it gets through the import and would help if I could speed things up. I am just running a test to see if changing the page size makes any difference then will check the GC memory.

abel

It is very common that parsing large XML files is a problem. Since the method you are using is the DOM model (even though you read sequentially, so you are actually blending two methods: SAX and DOM, and on every next XmlReader.Read the DOM is increased), the memory will increase.

The DOM model, and this is very rough, takes on average 4 to 5 times the size of the number of characters in your XML data times two (if the data is UTF-8 and you load it in .NET, it is loaded in internal strings as UTF-16, which doubles the storage needed).

Now that you gave a glimpse of how huge your data is (13000 house codes... times 250 arrival dates, assuming you mean a whole Planning block) we can calculate how large it really is (you haven't told us yet...). A planning block seems to be about 500 characters in size, the overhead for a house code is minimal on 250 planning blocks, so I keep it simple, my guess is that you have:

13000 x 250 x 500 x 2 x 10% = 3,575,000,000

on disk this is probably half of it (assuming UTF-8 or ASCII/ANSI variant), but then it is still about 1.8 GB. In memory this may take up, if loaded in full, 3.5GB x 4 = 14 GB.

Now, I don't know what type of system you have, but if my rough calculations are only nearly correct, there's no reasonable system that can run through that amount of data. Even if the whole data is 100x so small, you will have loading problems.

In short: we'll have to dig up a different way for dealing with this data. XML + DOM is just very bad for dealing with large sets of XML (> 10 MB is considered large by me).

-- Abel --

abel

My apologies. You are using XmlTextReader and apparently I wasn't up-to-speed with the technique: it is a forward-only parser of the XML data and does not provide caching, which is why you have chosen it in the first place.

Still it would be interesting to find out what the memory does.

The reason for your lower speed must lie somewhere else. I can try to mimic your situation by creating a very large XML file, but I would still love to know how large your data really is. Can you follow-up on that one?

-- Abel --

SOLUTION

abel

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

chrispaton

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

abel

No problem for the delay, research questions on improvement are often postponed in favor of functionality requests of products.

In general, rereading some of the posts: when speed is an issue, XmlTextReader is your friend. If you get slower performance, it will not be because of the XmlTextReader, instead, it will probably be due to resources (cpu or mem) that intermittent variables or intermittent partial XDocuments take.

I won't object for the closing, seems fair in my opinion. If you need professional assistance for your problem, I can work remotely, you can check my contact details. You can also ask a new question on this forum, of course.

-- Abel --