I am trying to import a large collection of XML files into a database. The collection contains 1400+ files. Each file on average has 300MB of data. Each file contains a group of 1000+ XML documents within it - multiple instances of the following tags:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE abc-new_doc SYSTEM "abc-08-23.dtd" [ ]>
<abc-new_doc lang="EN" dtd-version="v4.2 2006-08-23" file="20110125.XML" status="PRODUCTION" id="abc-new_doc " country="US" date-reported="20110110">
The first challenge is to deal with grouped xml documents. The current plan is to write a vb program that will parse all 1400+ files and remove the subdocument (<?XML and <!DOCTYPE) xml identifier tags. The program will also add a <group> tag to the beginning of each file and the corresponding </group> tag to the end of each file. Hopefully, this will create an xml - conforming structure with only one 1st level tag.
The second challenge is to find a robust data transformation tool, that can import 300mb XML files and insert them into some database. I read good things about Talend and Pentaho - I am not familiar with either of those. I want to make sure that the tool I select can be used from a command line (I would have to automate the importation of 1400 files) and can also handle 300mb XML docs. (the other approach is to separate the 300mb group xml docs into smaller subfiles)
The third challenge is a choice of a database. The considerations here is cost, ability to handle a large volume of data, and, optionally, analytic capability for analyzing the data. I am currently debating between MySQL and SQL Server. Oracle DB seems to be more expansive.than SQL server and I am not sure I need it.
This is a challenging project (at least for me). I will award the points to the expert who can guide me through the implementation.