asked on

Help parsing a large XML file

I need a script written to take a large xml file and extract information into a csv or directly to an MS SQL Database. Preferably in powershell.

Attached is the a 'sample' of the XML file. The actual file size is ~800 mb. I have a Python script that plods through it on smaller file sizes (~14mb), but will take an amazing amount of time for the large file. Really even a script to preprocess the file and take out all the line items with ServicePointChannelID with the 400 range would be nice, then my python script would work good.

Attached is a sample xml file, a sample output file, and the python script that creates it.

Ideally it would be nice to go over to powershell with this from Python.
test2.xml
dataexport.csv
dataexport.py

ste5an

Why not parsing it directly in SQL Server?

Otherwise, when using PowerShell, then it means that you should use XmlReader on a stream to avoid memory issues. As your XML is pretty straight forward and you only need a simple parsing strategy, this would be a viable way.

Scott Abraham

ASKER

Why not parse it directly in SQL? Lack of skill level to do so.

Sharath S

Do you finally want the data in SQL table? or CSV would be fine?

pepr

The used xml.etree.ElementTree is a very nice tool. The problem is that its .parse() method loads the XML into memory.

If you want to process really big XML files, you may be interested in the xml.etree.ElementTree.XMLPullParser class (https://docs.python.org/3/library/xml.etree.elementtree.html#pull-api-for-non-blocking-parsing).

I did not check, but the standard module should be at least similar to lxml for that. Have a look at examples of parsing at https://lxml.de/parsing.html

Scott Abraham

ASKER

Sharath, The data will be going into an SQL table. Right now I take it and put it into a CSV and then into the SQL. I am going to take a look this morning at putting it directly in.

pepr, looking those up.

aikimark

1. I'm having trouble inferring the filtering and transformations required to create your csv file from your xml file. Are these two (posted) files a representative sample of your data problem?

take out all the line items with ServicePointChannelID with the 400 range would be nice

2. Do you want the csv file to contain just 400-499 data or do you want the 400-499 data removed from the process?

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aikimark

One thing you might consider is buffering the XML you feed to the XML-parsing code. You grab several (tens of) thousands of bytes at a time, prepending any extra characters from the prior read. You find the last terminating tag (</Channel>) and save any trailing characters to prepend the next read.

You will probably sandwich the text within <Channels> and </Channels> tags and pass to the XML-parsing routine.

Since your XML file is most likely residing on an NTFS hard drive, I'd recommend buffer sizes that are multiples of 4096 (i.e. 8192, 20480, 40960, 204800, 409600, etc).

pepr

@aikimark, @Scott: the XML buffering is what the above code does. Notice the line 10 where the block_size is set, and the line 18 and 60 where the file is read by the chunks of that size. The parser actually does glue the chunks for you, and builds the elements to the structure only when it has everything to do that. This is the purpose of the 'end' action (it signals that the end tag of an element was found), and if that is the wanted 'Channel', it is processed.

The XMLPullParser still builds the tree in memory (good for cases if you want to build it when the data comes through socket, and you do not want or cannot wait until the data stream is closed). For processing really big data, using this approach is not good enough... unless you throw away the elements that were already processed. In other words, you modify the tree during the building. This is done on the line 57 by channel.clear(). Actually, it should be done also for each other finished element that is not processed -- if there may be a lot of them before the Channels.

Scott Abraham

ASKER

Thanks pepr for the solution.

pepr

@Scott Abraham: You are welcome. ;)