Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Large scale XML to DB transformation

Posted on 2011-03-12
12
738 Views
Last Modified: 2012-08-14
I am trying to import a large collection of XML files into a database. The collection contains 1400+ files. Each file on average has 300MB of data.  Each file contains a group of 1000+ XML documents within it - multiple instances of the following tags:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abc-new_doc SYSTEM "abc-08-23.dtd" [ ]>
<abc-new_doc  lang="EN" dtd-version="v4.2 2006-08-23" file="20110125.XML" status="PRODUCTION" id="abc-new_doc " country="US" date-reported="20110110">


The first challenge is to deal with grouped xml documents. The current plan is to write a vb program that will parse all 1400+ files and remove the subdocument (<?XML and <!DOCTYPE) xml identifier tags. The program will also add a <group> tag to the beginning of each file and the corresponding </group> tag to the end of each file. Hopefully, this will create an xml - conforming structure with only one 1st level tag.

The second challenge is to find a robust data transformation tool, that can import 300mb XML files and insert them into some database. I read good things about Talend and Pentaho - I am not familiar with either of those. I want to make sure that the tool I select can be used from a command line (I would have to automate the importation of 1400 files) and can also handle 300mb XML docs. (the other approach is to separate the 300mb group xml docs into smaller subfiles)

The third challenge is a choice of a database. The considerations here is cost, ability to handle a large volume of data, and, optionally, analytic capability for analyzing the data. I am currently debating between MySQL and SQL Server. Oracle DB seems to be more expansive.than SQL server and I am not sure I need it.

This is a challenging project (at least for me). I will award the points to the expert who can guide me through the implementation.

Thank you.


 
0
Comment
Question by:cyber-33
  • 6
  • 5
12 Comments
 
LVL 1

Expert Comment

by:robbert_nl
ID: 35123157
1st challenge:

VB or VB.NET should do the job. Both come with objects to parse XML objects and there are objects available online that you can add. The biggest challenge however will be finding the right object that can parse such big XML files as to prevent from stack overflow.

2nd challenge:

I don't know Talend and Pentaho, so I cannot comment on those.

If you are going to make a VB program that already parses your XML documents to remove certain XML tags, why not make a VB program that parses these documents and inserts the data into a database. Saves you the time selecting a data transformation tool and getting familiar with it.

3rd challenge:

I see no reason why you can't use MySQL. It is capable of handling very large volumes too, if you use the XtraDB storage engine, which you can find in Percona Server or Maria DB.
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35126957
In the case that you choose for SQL Server, you'll get a data transformation service for free.  This is called Integration Services (SSIS).  It comes with an IDE (BIDS - Business Intelligence Development Studio) and can be used to build "packages" to import XML files.

Here's a small "how to" on how you'd set up such a flow: http://www.ehow.com/how_7565252_import-xml-files-relational-database.html

That is just an explanation on how to import one file.  You should set up a Control Flow that loops over your files and imports them.

Packages can be scheduled using the SQL Server Agent, and thus can the import process be automated.
0
 

Author Comment

by:cyber-33
ID: 35138797
Robert and Valentinov - thank you for your attention to this project. I will take a look at the MS SQL Server article - it does look interesting.

I am concerned that because I don't deal with regular XML files, and I have to separate them using VB, no transformation package would help me. I could use the transformation package after the files are separated, but I expect to generate about 5M files. Storing such large number in a single folder structure may not be a good idea (and I am not sure that windows allows it).

The only option I see is for each separated XML file, to
(1) Store it on a hard drive
(2) Generate DB Insert sql for that file
(3) execute the sql
(4) remove the file and repeat from (1)

The generate DB insert SQL step (2) can be tricky because the structure of my xml files is slightly different and I would like to preserve it.

So step (2) is the next challange :)

One approach is to do the processing before database insertion. For that, I found an interesting utility: Advanced XML Converter http://www.xml-converter.com It can generate SQL insert code (including table creation) for XML files using a command line interface. I am concerned, thought, that if I use this utility 5M times.


Another approach would be to create a table with 2 columns: ID, XML. and Populate the XML column using the entire file. Then, after the data is in the database, there might be some tools built into the engines that can process the XML column and separate it into individual tables.

I suspect the second approach will be easier to implement.

Please Let me know if you have any thoughts about it, or know a specific way to take an XML column and separate the data into a normalized structure.

Thank you.
0
Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

 

Author Comment

by:cyber-33
ID: 35139016
Also, I am having trouble separating the combination files into separate xml files. The .NET XMLReader object chokes every time it encounters a new XML declaration within the same XML file. That probably means that I cannot use the XMLReader object and should parse the file line by line as a regular text.

Agreed?
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35139125
Wow, 5 million files??  Djeez, that's more than I've encountered so far...

As for importing complex XML structures using SSIS, I have to say that the XML source component does have its limitations.  It cannot handle complex structures.  What I usually do is create XSLT files to extract data out of the XML files and save it to new files in CSV format.

Applying an XSLT to an XML file can be done through the "XML Task", one of the components in the Control Flow.  More info here: http://msdn.microsoft.com/en-us/library/ms141055.aspx

I'm not sure about the maximum size that this component is able to process, so that may need to be tested first.

There's some interesting info here regarding the limitations of that XML Task: http://connect.microsoft.com/SQLServer/feedback/details/483678/ssis-xml-task-xslt-operation-scalability-issue#details

Taking that info into account: maybe XSLT is the way to go and get your files processed.  You'll just need to find a good XSLT processor that can handle large files smoothly. (possibly custom .NET app?)
In case of a custom app, if you build a command-line app with parameters it will be possible to use SSIS to orchestrate everything (loop over files, fetch, process through custom app, archive, import CSVs into database, ...)
0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35139168
Also, I am having trouble separating the combination files into separate xml files. The .NET XMLReader object chokes every time it encounters a new XML declaration within the same XML file. That probably means that I cannot use the XMLReader object and should parse the file line by line as a regular text.

As far as I'm aware that will always be a problem that will require a custom app of some sort.  A valid XML file should only once contain "an XML doc" :-)
0
 

Author Comment

by:cyber-33
ID: 35139835
I am not really an expert in VB.NET. Can you suggest a good code to parse the file that contains multiple tags like the two below?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abc-new_doc SYSTEM "abc-08-23.dtd" [ ]>
<abc-new_doc  lang="EN" dtd-version="v4.2 2006-08-23" file="20110125.XML" status="PRODUCTION" id="abc-new_doc " country="US" date-reported="20110110">


0
 
LVL 37

Expert Comment

by:ValentinoV
ID: 35158664
The following code snippet should do the trick. (code tested in a VB.NET "Windows Forms Application" in Visual Studio 2008)

It assumes that each third line contains a file="YYYYMMDD.XML" attribute, just like in your example.  Further assumptions are mentioned as comments in the code.

It takes your input file consisting of multiple XML documents and creates separate XML files.  The name of the output files is the one mentioned in the file attribute.

Don't hesitate to let me know if something here is not clear.
Imports System.IO
Imports System.Text

Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

        Dim fileLocation As String = New String("C:\test\dotnet\VB_FileSplit\")
        Dim myStreamReader As New StreamReader(fileLocation & "input.txt")

        Dim line As String
        Dim filename As String
        Dim filenameStart As Integer
        Dim myStreamWriter As StreamWriter
        Dim sb As StringBuilder

        'process input file line by line
        While Not myStreamReader.EndOfStream
            line = myStreamReader.ReadLine()
            'assumes that each XML document starts with the line below
            If line.Contains("<?xml version=""1.0"" encoding=""UTF-8""?>") Then
                If Not IsNothing(myStreamWriter) Then
                    myStreamWriter.Close()
                End If

                sb = New StringBuilder() 'temporarily store lines in StringBuilder
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read DOCTYPE line
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filename = line.Substring(filenameStart, 12) 'assumes that the file name is always 12 characters long

                myStreamWriter = New StreamWriter(fileLocation & filename)
                myStreamWriter.Write(sb) 'write previously read files to new output file
            End If
            myStreamWriter.WriteLine(line)
        End While

        myStreamWriter.Close()
        myStreamReader.Close()

    End Sub
End Class

Open in new window

0
 

Author Comment

by:cyber-33
ID: 35169389
ValentinoV,
Sorry for the delayed response. The code is very help and it works. THANK YOU. I tweaked it to remove the 12 character file name assumption like this:

                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filenameEnd = line.IndexOf(".XML") + 4
                filename = line.Substring(filenameStart, filenameEnd - filenameStart)

The issue now is that I have 1000s of the files I need to process, so the function that you specified needs to be adjust to take the input file name as a command line attribute.

1. Do you know an easy wat to do it in VB?
2. Some XMLs reference DTD files, which I don't have. Given the fact that I am planning to import the separated files into the database, I am not sure that I even need the DTD files. So, I am thinking about tweaking the function further to remove the DTD lines in the process of file separation. Any thought?

0
 

Author Comment

by:cyber-33
ID: 35169418
Another issue I have is I need to implement some good try-catch- log functionality. I am not a VB expert, so if you could suggest a good way to log the errors, as I am parsing my thousands of files, I would really appreciate it.

After these 3 modifications (command line, DTD removal, loggin), I will award all 500 points to you and will start another question for any further developments on this project. Again, thank you for your help.
0
 
LVL 37

Accepted Solution

by:
ValentinoV earned 500 total points
ID: 35182065
Have a look at the attached code.  All three questions have been addressed.  It has been tested in a Console application so that we've got support for command-line parameters.

The application expects two parameters: first the input file (with full path), then the output folder (should be an existing folder).  More comments in the code.

I've put all the code in a try/catch block.  Whenever an error occurs it will get logged to a file (which is hard-coded as you'll see).  Please note that this means that, whenever the application encounters an error, it will get logged in that file.  But the application itself will not fail.  So if executed through a SSIS "Execute Process" task, you won't notice any issues.  The only way to know is in the file itself.

Another option is to just let the exception pass to SSIS (remove try/catch) and use an event handler there to log the error to an errorlog table for instance.

Now, as for the remaining question, the removal of the DTD lines, that's an easy one.  Just comment out the line that writes the DOCTYPE lines to the output file (see comment in code).  It assumes that the DOCTYPE line is always the second line in the XML doc.

Hope this helps?

Regards,
Valentino.

Imports System.IO
Imports System.Text

Module Module1

    Sub Main()

        Try
            Dim commandLineArgs = Environment.GetCommandLineArgs()
            Dim inputFile = commandLineArgs(1) 'full path + input file name
            Dim outputFolder = commandLineArgs(2) 'assumes that outputFolder ends in a backslash

            ProcessFile(inputFile, outputFolder)

        Catch ex As Exception
            Dim errorStreamWriter As New StreamWriter("c:\test\dotnet\vb_filesplit\error.txt")
            errorStreamWriter.WriteLine(ex.ToString())
            errorStreamWriter.Close()
        End Try

    End Sub

    Private Sub ProcessFile(ByVal inputFile As String, ByVal outputFolder As String)

        Dim myStreamReader As New StreamReader(inputFile)

        Dim line As String
        Dim filename As String
        Dim filenameStart As Integer
        Dim filenameEnd As Integer
        Dim myStreamWriter As StreamWriter
        Dim sb As StringBuilder

        'process input file line by line
        While Not myStreamReader.EndOfStream
            line = myStreamReader.ReadLine()
            'assumes that each XML document starts with the line below
            If line.Contains("<?xml version=""1.0"" encoding=""UTF-8""?>") Then
                If Not IsNothing(myStreamWriter) Then
                    myStreamWriter.Close()
                End If

                sb = New StringBuilder() 'temporarily store lines in StringBuilder
                sb.AppendLine(line)
                line = myStreamReader.ReadLine() 'read DOCTYPE line
                sb.AppendLine(line) 'REMOVE THIS LINE to remove the DOCTYPE line from the output

                line = myStreamReader.ReadLine() 'read line containing file name
                filenameStart = line.IndexOf("file=""") + 6
                filenameEnd = line.IndexOf(".XML") + 4
                filename = line.Substring(filenameStart, filenameEnd - filenameStart)

                myStreamWriter = New StreamWriter(outputFolder & filename)
                myStreamWriter.Write(sb) 'write previously read files to new output file
            End If
            myStreamWriter.WriteLine(line)
        End While

        myStreamWriter.Close()
        myStreamReader.Close()
    End Sub

End Module

Open in new window

0
 

Author Comment

by:cyber-33
ID: 35189913
ValentinoV,
Thank you very much for your help on this.

Cyber-33.
0

Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
In this article I will describe the Backup & Restore method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

789 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question