Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 618
  • Last Modified:

Create xml files from sql server tables

I'm trying to create xml files from large record sets (2 million+ records). The xml files need to have a header detail relationship (i.e there is an Invoice header that is stored in the Invoices table and one or more Invoice Line Items for the invoice that are stored in the InvoiceLineItems table).  I've created a stored proc that returns xml using the follwing query:


SELECT 1 AS Tag,
NULL AS Parent,
InvoiceULID AS [SD_Invoice!1!InvoiceULID!element],
InvoiceDate AS [SD_Invoice!1!InvoiceDate!element],
NULL AS [SD_InvoiceLineItems!2!InvoiceLineItemULID!element],  
NULL AS [SD_InvoiceLineItems!2!InvoiceULID!element],
NULL AS [SD_InvoiceLineItems!2!InvoiceLineNumber!element],
NULL AS [SD_InvoiceLineItems!2!PurchaseOrderDate!element],
NULL AS [SD_InvoiceLineItems!2!PurchaseOrderULID!element],
FROM Invoices
UNION ALL
SELECT  2,
1,  
i.InvoiceULID,
NULL,
li.InvoiceLineItemULID,
li.InvoiceULID,
li.InvoiceLineNumber,
li.PurchaseOrderDate,
li.PurchaseOrderULID
FROM InvoicesLineItems li, Invoices i
WHERE li.RecordID = i.RecordID
ORDER BY [SD_Invoice!1!InvoiceULID!element], [SD_InvoiceLineItems!2!InvoiceLineItemULID!element]  
FOR XML EXPLICIT

I want to save the resulting xml into an xml document.  I execute the stored proc from within an ActiveX script in a DTS package using the following code:



'Note:  values of all global variables were set in previous dts steps
Function Main()
    Dim cmd, sql, dom, xmlFileName, tagFile, tagFileName, fileType, _
        timeStamp, sourceID, fso, rootTag, encoding

    'Create DOMDocument from "Microsoft XML Core Services" API
    Set dom = CreateObject("Msxml2.DOMDocument.4.0")

    'Create ADO Command object
    Set cmd = CreateObject("ADODB.Command")
   
    cmd.ActiveConnection = DTSGlobalVariables("SQL_Connection").Value
   
    timeStamp = DTSGlobalVariables("Timestamp")
    rootTag = DTSGlobalVariables("RootTag")      
     
    'Execute XML Template
    sql = "<" & rootTag & " timestamp='" & timeStamp & _
          "' xmlns:sql=""urn:schemas-microsoft-com:xml-sql"">" & _
          "<sql:query>EXEC pr_CreateXML 'Invoices' </sql:query></" & rootTag & ">"
   
 
    cmd.CommandText = sql
   
    'never timeout
    cmd.CommandTimeout = 0  
   
    'GUID of SQL XML format used by SQL Server 2000
    '(Dialect identifies what form of XML the Recordset is stored in)
    cmd.Dialect = "{5D531CB2-E6Ed-11D2-B252-00C04F681B71}"
   
    'set output stream to DOM
    cmd.Properties("Output Stream") = dom
   
    'the results of a command execution should be returned as a stream (adExecuteStream = 0x400)
    cmd.Execute , , 1024

    'create xml file
    xmlFileName = DTSGlobalVariables("FileName") & ".xml"
    dom.Save xmlFileName
   
   
    Main = DTSTaskExecResult_Success
End Function


The stored proc and dts package work fine when the tables accessed by the stored proc are small, but with large record sets the package takes hours to create the xml document.  For really large record sets,  the script will fail and I will get an error message stating that there is not enough storage space to complete the operation.  I have 200 GB of free space on the drive, so I wouldn't think space would cause the problem.

Is there a way to speed up the xml file creation process, or a different way all together to create my xml documents?

Note: I've already looked at the execution plan of the query in the stored proc.  The tables are indexed appropriatly.
0
AdvIT
Asked:
AdvIT
  • 7
  • 6
  • 3
  • +2
2 Solutions
 
volkingCommented:
hmmmmm ... I had a similar problem (1.8 million records). Eventually, I gave up the elegant solution and took a brute-force approach. Create a table to be used as an accumulator. The accumulator table has two columns. An identity auto-number and a varchar(1000). I started a cursor walking the 1.8 million records, pumping single node records into the accumulator table. When finished I simply did a select from the accumulator table in identity order.

Not elegant, but it works, and total processing time was about 10 minutes.

good luck
0
 
AdvITAuthor Commented:
I've tried several variations of the SELECT statement.  I initially stored all the records flattened out in a single Invoice table, but I wasn't getting the correct header/detail relationship in my xml document.  Will your suggestion result in the correct structure in the xml document?

I need the resulting document to look like this:
<Invoices>
<SD_Invoice>
    <InvoiceULID>abc</InvoiceULID>
    <InvoiceDate>01/01/2000</InvoiceDate>
    <SD_InvoiceLineItems>
        <InvoiceLineItemULID>cde</InvoiceLineItemULID>
         ...
    </SD_InvoiceLineItems>
</SD_Invoices>
</Invoices>

0
 
ShogunWadeCommented:
"but with large record sets the package takes hours to create the xml document.  For really large record sets, "   how large is large and how large is very large ?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
AdvITAuthor Commented:
Sorry, by large I mean around 1 million records, by really large I mean over 2 million records.  The counts reflect the number of records in the InvoiceLineItems table.  The Invoice table had 600,000 records for every test scenario.
0
 
ShogunWadeCommented:
wow and you want all this in an single XML document?

something i think to bear in mind, SQL 2000 wasnt really designed for XML,  XML was grafted on to it because MS needed to keep up with Oracle.  I understand Yukon is far better at it.    Anyway, as such it is not as optimal as it could be (to say the least).  

I know I probably dont know enough about the purpose of this task but, i would think that realistically the production of an XML document containing over 2 million records is somewhat overambitious for sql server.
0
 
AdvITAuthor Commented:
Ideally I would like to contain all code to create the xml documents in a single dts package which would reside on the server and be scheduled to execute weekly.  Are there any ways to accomplish this considering the size of the data that I am working with?
0
 
SoftEng007Commented:
I don't know if this would work but
since your proc outputs xml couldn't you just pass the stream to an fso textstream and write it directly to disk?
loading that large of a document in dom (dom loads all nodes in memory) is probably whats killing you,
and the only thing I see you doing with the dom is saving to disk.
0
 
Anthony PerkinsCommented:
SoftEng007,

You make a very good point.  No doubt loading the Xml document in the MSXML DOM is a large contributing factor in the lousy performance. In fact I would go one step further and say there is no need to use even the File System Object, all you need is the Stream object.  In fact on closer scrutiny, I do not see any need for the Dialect property (although I suspect I know where you got it from) it is not required for outputing Xml.  In other words, your function could be as simple as this (untested):

Function Main
Dim cmd
Dim stm

Set cmd = CreateObject("ADODB.Command")
With cmd
      .ActiveConnection = DTSGlobalVariables("SQL_Connection").Value
      .CommandType = adCmdText
      .CommandText = "EXEC pr_CreateXML 'Invoices'"
      .CommandTimeout = 120         ' Adjust this appropriately
      Set stm = CreateObject("ADODB.Stream")
      stm.Open
      .Properties("Output Stream").Value = stm
      .Execute , , adExecuteStream
End With
Set cmd = Nothing

stm.SaveToFile "YourFileNameGoesHere", adSaveCreateNotExist + adSaveCreateOverWrite
stm.Close
Set stm = Nothing

End Function
0
 
Anthony PerkinsCommented:
What you need to do is run the stored procedure in Query Analyzer and determine if it is returned fast enough, if it is than try the changes SoftEng007 and I have suggested.  If it is taking way too long even in Query Analyzer, than you need to rethink your plan ...

But your first priority is narrowing down the bottle-neck: Stored Procedure or VBScript.
0
 
AdvITAuthor Commented:
SoftEng007 and acperkins,

I tried your suggestions (abandoning dom and saving stream directly to disk), and I am getting the same error: 'Not enough storage space to complete this operation'.  I assume that storage space refers to memoy since I have plent of free disk space for the xml file that is to be created.  Since my query is performing a union on two large record sets (Invoice table has 300,000 records and InvoiceLineItems table has 2,700,000 records), then performing an ORDER BY on the results, is it that the entire result set is loaded into memory to perform the ordering and that is what is causing the problem?  I ran the stored proc in Query Analyzer on a smaller set of data  (50,000/400,000 records) and, while it took 1 hour to execute, it did complete.  Running the stored proc on the real data (300,000/2,700,000) the procedure did not complete after 9 hours of execution (I finally had to kill it). Do you think that this is just too much data on which to perform this type of operation?
0
 
SoftEng007Commented:
>>Since my query is performing a union on two large record sets (Invoice table has 300,000 records and InvoiceLineItems table has 2,700,000 records

if you mean a join instead of a union the most records returned should be 2,700,000
not to much for SQL server (asuming left join not cross join)

maybe indexes could help with your join time.
post your table structures and query and we can help refine the process.
0
 
Anthony PerkinsCommented:
SoftEng007,

>>if you mean a join instead of a union<<
it is a Union.  Most non-trivial Xml documents generated with Option Explicit require a UNION clause.

>>I ran the stored proc in Query Analyzer on a smaller set of data  (50,000/400,000 records) and, while it took 1 hour to execute, it did complete.  Running the stored proc on the real data (300,000/2,700,000) the procedure did not complete after 9 hours of execution (I finally had to kill it). <<
Then the answer is the stored procedure not VBScript is the bottleneck.

I suspect the solution is to created Xml documents in parts and then append them to the file using the FileSystemObject.
0
 
Anthony PerkinsCommented:
>>I am getting the same error: 'Not enough storage space to complete this operation'.  I assume that storage space refers to memoy since I have plent of free disk space for the xml file that is to be created. <<
I suspect this is in reference to the temporary sapce required by SQL Server to run queries.

>>Running the stored proc on the real data (300,000/2,700,000) <<
As I mentioned previously, you are going to have to execute this query in parts.  I would start with 10k invoices at a time and append them to an Xml file.
0
 
AdvITAuthor Commented:
The application that consumes these xml files uses the XmlTextReader, which has a 2 GB limitation on the files that it can read.  With this in mind, possibly the best solution would be to create the xml in parts, but rather than appending to the file, create new files for maybe every 100,000 header records.  The logic of the application allows several xml files of the same type to exist, as long as the timestamp attribute on each file is identical.

Possibly I can use a cursor in the stored proc that grabs 100,000 header records (Invoice table) at a time.  I can change the stored proc to accept a range parameter (on the table's identity column) and  put the code that creates the file in a loop.  Is this the best/only solution left?
0
 
Anthony PerkinsCommented:
>>Is this the best/only solution left?<<
IMHO, yes.  Alsthough I trust you do not mean "a cursor" in the T-SQL sense, but rather a subset of rows using an appropriate WHERE clause or INNER JOIN.  Using a CURSOR would only make it that slower.

Also, you need to focus on speeding up creating the Xml in SQL Query Analyzer, when you have that at an acceptable speed than and only than move on to the second part.
0
 
SoftEng007Commented:
acperkins:
>>it is a Union.  Most non-trivial Xml documents generated with Option Explicit require a UNION clause.
yeah missed that completely. I was thinking about a base query.

Advit:
your next option is to move off of SQL and build a .net app that can read the header rows and loop thru them to build the detail nodes for each one using streamwriter or dom.
0
 
AdvITAuthor Commented:
I alread wrote a C# app that creates the xml files.  The app will even splits the files that are larger than 2GB so that the application that uses the files can actually read them.  The problem is that the decision was made to have all the code to create the xml files contained in DTS package(s) that can be scheduled on a predetermined basis.  The requirements are that no external components are used , i.e. write everything in t-sql and vbscript.  I am currently experimenting with using the identity field to grab 100,000 records at a time.  Something along these lines:

declare @min_ident
declare @max_ident
declare @rows int

set @min_ident = 0
 
while 1 = 1
begin
    -- see if any records exist
    select @rows = count(*)
    from
    (
    select top 1 identcol
    from invoices
    where identcol > @min_ident
    order by identcol
    ) a

    -- no records to use for xml file, exit
    if @rows = 0
    begin
        return
    end
   

    -- get greatest identity column value returned
    select top 1 @max_ident = identcol
    from
    (
    select top 100000 identcol
    from invoices
    where identcol > @min_ident
    order by identcol
    ) a
    order by a.identcol desc

   -- ***************************************************
   -- call stored proc to create xml files here
   -- pass @min_ident and @max_ident as parameters to stored proc
   -- ***************************************************
   
    set @min_ident = @max_ident
end
go


******** changes to stored proc
SELECT 1 AS Tag,
NULL AS Parent,
InvoiceULID AS [SD_Invoice!1!InvoiceULID!element],
InvoiceDate AS [SD_Invoice!1!InvoiceDate!element],
NULL AS [SD_InvoiceLineItems!2!InvoiceLineItemULID!element],  
NULL AS [SD_InvoiceLineItems!2!InvoiceULID!element],
NULL AS [SD_InvoiceLineItems!2!InvoiceLineNumber!element],
NULL AS [SD_InvoiceLineItems!2!PurchaseOrderDate!element],
NULL AS [SD_InvoiceLineItems!2!PurchaseOrderULID!element],
FROM Invoices
WHERE identcol between  @min_ident  and @max_ident                              -- add this  **************
UNION ALL
SELECT  2,
1,  
i.InvoiceULID,
NULL,
li.InvoiceLineItemULID,
li.InvoiceULID,
li.InvoiceLineNumber,
li.PurchaseOrderDate,
li.PurchaseOrderULID
FROM InvoicesLineItems li, Invoices i
WHERE li.RecordID = i.RecordID
and i.identcol between @min_ident and @max_ident                                                 -- add this  **********
ORDER BY [SD_Invoice!1!InvoiceULID!element], [SD_InvoiceLineItems!2!InvoiceLineItemULID!element]  
FOR XML EXPLICIT



Any thoughts on this approach?



   

0
 
Anthony PerkinsCommented:
>>with using the identity field to grab 100,000 records at a time.<<
I would reduce that even further to 10K rows.

Another approach would be to either modify your Invoices table and add flag (XmlProcessed column) or create another table (XmlProcessed table) that includes processed invoices.  You could then just do a SELECT TOP 10000 (or whatever) based on the unprocessed invoices.  When one batch is done, update the flag in the Invoices table or add the invoices to the XmlProcessed table.

Trust that makes sense.
0
 
AdvITAuthor Commented:
I got it to work by grabbing n rows at a time and creating several xml files.  The process is pretty fast compared to previous attemps.  I was able to create seven 600 MB files in about 20 minutes.

Thanks for the help.
0

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 7
  • 6
  • 3
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now