Solved

Large scale XML to DB transformation

Posted on 2011-03-12
12
732 Views
Last Modified: 2012-08-14
I am trying to import a large collection of XML files into a database. The collection contains 1400+ files. Each file on average has 300MB of data.  Each file contains a group of 1000+ XML documents within it - multiple instances of the following tags:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abc-new_doc SYSTEM "abc-08-23.dtd" [ ]>
<abc-new_doc  lang="EN" dtd-version="v4.2 2006-08-23" file="20110125.XML" status="PRODUCTION" id="abc-new_doc " country="US" date-reported="20110110">


The first challenge is to deal with grouped xml documents. The current plan is to write a vb program that will parse all 1400+ files and remove the subdocument (<?XML and <!DOCTYPE) xml identifier tags. The program will also add a <group> tag to the beginning of each file and the corresponding </group> tag to the end of each file. Hopefully, this will create an xml - conforming structure with only one 1st level tag.

The second challenge is to find a robust data transformation tool, that can import 300mb XML files and insert them into some database. I read good things about Talend and Pentaho - I am not familiar with either of those. I want to make sure that the tool I select can be used from a command line (I would have to automate the importation of 1400 files) and can also handle 300mb XML docs. (the other approach is to separate the 300mb group xml docs into smaller subfiles)

The third challenge is a choice of a database. The considerations here is cost, ability to handle a large volume of data, and, optionally, analytic capability for analyzing the data. I am currently debating between MySQL and SQL Server. Oracle DB seems to be more expansive.than SQL server and I am not sure I need it.

This is a challenging project (at least for me). I will award the points to the expert who can guide me through the implementation.

Thank you.


 
0
Comment
Question by:cyber-33
  • 6
  • 5
12 Comments
 
LVL 1

Expert Comment

by:robbert_nl
ID: 35123157
1st challenge:

VB or VB.NET should do the job. Both come with objects to parse XML objects and there are objects available online that you can add. The biggest challenge however will be finding the right object that can parse such big XML files as to prevent from stack overflow.

2nd challenge:

I don't know Talend and Pentaho, so I cannot comment on those.

If you are going to make a VB program that already parses your XML documents to remove certain XML tags, why not make a VB program that parses these documents and inserts the data into a database. Saves you the time selecting a data transformation tool and getting familiar with it.

3rd challenge:

I see no reason why you can't use MySQL. It is capable of handling very large volumes too, if you use the XtraDB storage engine, which you can find in Percona Server or Maria DB.
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35126957
In the case that you choose for SQL Server, you'll get a data transformation service for free.  This is called Integration Services (SSIS).  It comes with an IDE (BIDS - Business Intelligence Development Studio) and can be used to build "packages" to import XML files.

Here's a small "how to" on how you'd set up such a flow: http://www.ehow.com/how_7565252_import-xml-files-relational-database.html

That is just an explanation on how to import one file.  You should set up a Control Flow that loops over your files and imports them.

Packages can be scheduled using the SQL Server Agent, and thus can the import process be automated.
0
 

Author Comment

by:cyber-33
ID: 35138797
Robert and Valentinov - thank you for your attention to this project. I will take a look at the MS SQL Server article - it does look interesting.

I am concerned that because I don't deal with regular XML files, and I have to separate them using VB, no transformation package would help me. I could use the transformation package after the files are separated, but I expect to generate about 5M files. Storing such large number in a single folder structure may not be a good idea (and I am not sure that windows allows it).

The only option I see is for each separated XML file, to
(1) Store it on a hard drive
(2) Generate DB Insert sql for that file
(3) execute the sql
(4) remove the file and repeat from (1)

The generate DB insert SQL step (2) can be tricky because the structure of my xml files is slightly different and I would like to preserve it.

So step (2) is the next challange :)

One approach is to do the processing before database insertion. For that, I found an interesting utility: Advanced XML Converter http://www.xml-converter.com It can generate SQL insert code (including table creation) for XML files using a command line interface. I am concerned, thought, that if I use this utility 5M times.


Another approach would be to create a table with 2 columns: ID, XML. and Populate the XML column using the entire file. Then, after the data is in the database, there might be some tools built into the engines that can process the XML column and separate it into individual tables.

I suspect the second approach will be easier to implement.

Please Let me know if you have any thoughts about it, or know a specific way to take an XML column and separate the data into a normalized structure.

Thank you.
0
 

Author Comment

by:cyber-33
ID: 35139016
Also, I am having trouble separating the combination files into separate xml files. The .NET XMLReader object chokes every time it encounters a new XML declaration within the same XML file. That probably means that I cannot use the XMLReader object and should parse the file line by line as a regular text.

Agreed?
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35139125
Wow, 5 million files??  Djeez, that's more than I've encountered so far...

As for importing complex XML structures using SSIS, I have to say that the XML source component does have its limitations.  It cannot handle complex structures.  What I usually do is create XSLT files to extract data out of the XML files and save it to new files in CSV format.

Applying an XSLT to an XML file can be done through the "XML Task", one of the components in the Control Flow.  More info here: http://msdn.microsoft.com/en-us/library/ms141055.aspx

I'm not sure about the maximum size that this component is able to process, so that may need to be tested first.

There's some interesting info here regarding the limitations of that XML Task: http://connect.microsoft.com/SQLServer/feedback/details/483678/ssis-xml-task-xslt-operation-scalability-issue#details

Taking that info into account: maybe XSLT is the way to go and get your files processed.  You'll just need to find a good XSLT processor that can handle large files smoothly. (possibly custom .NET app?)
In case of a custom app, if you build a command-line app with parameters it will be possible to use SSIS to orchestrate everything (loop over files, fetch, process through custom app, archive, import CSVs into database, ...)
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35139168
Also, I am having trouble separating the combination files into separate xml files. The .NET XMLReader object chokes every time it encounters a new XML declaration within the same XML file. That probably means that I cannot use the XMLReader object and should parse the file line by line as a regular text.

As far as I'm aware that will always be a problem that will require a custom app of some sort.  A valid XML file should only once contain "an XML doc" :-)
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:cyber-33
ID: 35139835
I am not really an expert in VB.NET. Can you suggest a good code to parse the file that contains multiple tags like the two below?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abc-new_doc SYSTEM "abc-08-23.dtd" [ ]>
<abc-new_doc  lang="EN" dtd-version="v4.2 2006-08-23" file="20110125.XML" status="PRODUCTION" id="abc-new_doc " country="US" date-reported="20110110">


0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35158664
The following code snippet should do the trick. (code tested in a VB.NET "Windows Forms Application" in Visual Studio 2008)

It assumes that each third line contains a file="YYYYMMDD.XML" attribute, just like in your example.  Further assumptions are mentioned as comments in the code.

It takes your input file consisting of multiple XML documents and creates separate XML files.  The name of the output files is the one mentioned in the file attribute.

Don't hesitate to let me know if something here is not clear.
Imports System.IO
Imports System.Text

Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

        Dim fileLocation As String = New String("C:\test\dotnet\VB_FileSplit\")
        Dim myStreamReader As New StreamReader(fileLocation & "input.txt")

        Dim line As String
        Dim filename As String
        Dim filenameStart As Integer
        Dim myStreamWriter As StreamWriter
        Dim sb As StringBuilder

        'process input file line by line
        While Not myStreamReader.EndOfStream
            line = myStreamReader.ReadLine()
            'assumes that each XML document starts with the line below
            If line.Contains("<?xml version=""1.0"" encoding=""UTF-8""?>") Then
                If Not IsNothing(myStreamWriter) Then
                    myStreamWriter.Close()
                End If

                sb = New StringBuilder() 'temporarily store lines in StringBuilder
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read DOCTYPE line
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filename = line.Substring(filenameStart, 12) 'assumes that the file name is always 12 characters long

                myStreamWriter = New StreamWriter(fileLocation & filename)
                myStreamWriter.Write(sb) 'write previously read files to new output file
            End If
            myStreamWriter.WriteLine(line)
        End While

        myStreamWriter.Close()
        myStreamReader.Close()

    End Sub
End Class

Open in new window

0
 

Author Comment

by:cyber-33
ID: 35169389
ValentinoV,
Sorry for the delayed response. The code is very help and it works. THANK YOU. I tweaked it to remove the 12 character file name assumption like this:

                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filenameEnd = line.IndexOf(".XML") + 4
                filename = line.Substring(filenameStart, filenameEnd - filenameStart)

The issue now is that I have 1000s of the files I need to process, so the function that you specified needs to be adjust to take the input file name as a command line attribute.

1. Do you know an easy wat to do it in VB?
2. Some XMLs reference DTD files, which I don't have. Given the fact that I am planning to import the separated files into the database, I am not sure that I even need the DTD files. So, I am thinking about tweaking the function further to remove the DTD lines in the process of file separation. Any thought?

0
 

Author Comment

by:cyber-33
ID: 35169418
Another issue I have is I need to implement some good try-catch- log functionality. I am not a VB expert, so if you could suggest a good way to log the errors, as I am parsing my thousands of files, I would really appreciate it.

After these 3 modifications (command line, DTD removal, loggin), I will award all 500 points to you and will start another question for any further developments on this project. Again, thank you for your help.
0
 
LVL 37

Accepted Solution

by:
ValentinoV earned 500 total points
ID: 35182065
Have a look at the attached code.  All three questions have been addressed.  It has been tested in a Console application so that we've got support for command-line parameters.

The application expects two parameters: first the input file (with full path), then the output folder (should be an existing folder).  More comments in the code.

I've put all the code in a try/catch block.  Whenever an error occurs it will get logged to a file (which is hard-coded as you'll see).  Please note that this means that, whenever the application encounters an error, it will get logged in that file.  But the application itself will not fail.  So if executed through a SSIS "Execute Process" task, you won't notice any issues.  The only way to know is in the file itself.

Another option is to just let the exception pass to SSIS (remove try/catch) and use an event handler there to log the error to an errorlog table for instance.

Now, as for the remaining question, the removal of the DTD lines, that's an easy one.  Just comment out the line that writes the DOCTYPE lines to the output file (see comment in code).  It assumes that the DOCTYPE line is always the second line in the XML doc.

Hope this helps?

Regards,
Valentino.

Imports System.IO
Imports System.Text

Module Module1

    Sub Main()

        Try
            Dim commandLineArgs = Environment.GetCommandLineArgs()
            Dim inputFile = commandLineArgs(1) 'full path + input file name
            Dim outputFolder = commandLineArgs(2) 'assumes that outputFolder ends in a backslash

            ProcessFile(inputFile, outputFolder)

        Catch ex As Exception
            Dim errorStreamWriter As New StreamWriter("c:\test\dotnet\vb_filesplit\error.txt")
            errorStreamWriter.WriteLine(ex.ToString())
            errorStreamWriter.Close()
        End Try

    End Sub

    Private Sub ProcessFile(ByVal inputFile As String, ByVal outputFolder As String)

        Dim myStreamReader As New StreamReader(inputFile)

        Dim line As String
        Dim filename As String
        Dim filenameStart As Integer
        Dim filenameEnd As Integer
        Dim myStreamWriter As StreamWriter
        Dim sb As StringBuilder

        'process input file line by line
        While Not myStreamReader.EndOfStream
            line = myStreamReader.ReadLine()
            'assumes that each XML document starts with the line below
            If line.Contains("<?xml version=""1.0"" encoding=""UTF-8""?>") Then
                If Not IsNothing(myStreamWriter) Then
                    myStreamWriter.Close()
                End If

                sb = New StringBuilder() 'temporarily store lines in StringBuilder
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read DOCTYPE line
                sb.AppendLine(line) 'REMOVE THIS LINE to remove the DOCTYPE line from the output

                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filenameEnd = line.IndexOf(".XML") + 4
                filename = line.Substring(filenameStart, filenameEnd - filenameStart)

                myStreamWriter = New StreamWriter(outputFolder & filename)
                myStreamWriter.Write(sb) 'write previously read files to new output file
            End If
            myStreamWriter.WriteLine(line)
        End While

        myStreamWriter.Close()
        myStreamReader.Close()
    End Sub

End Module

Open in new window

0
 

Author Comment

by:cyber-33
ID: 35189913
ValentinoV,
Thank you very much for your help on this.

Cyber-33.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now