Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Handling very large xls in c#

Posted on 2011-04-27
23
Medium Priority
?
3,480 Views
Last Modified: 2013-11-17
Hi there,
I need to process some very large .xls files from C#. The particular method's purpose is quite simple, it has to take the .xls and convert all the worksheets into separate .csv files. However, it throws an OutOfMemory exception when I try to convert the range into an object array.
The files are usually about 250 MB plus.
Any ideas on how this can be done please?

Thanks,
Leo
public static string CreateCsvFromXls(string xlsPath)
        {
            try
            {
                if (!File.Exists(xlsPath))
                    throw (new Exception("File Not found"));

                int ix = xlsPath.LastIndexOfAny(new char[] { '\\' });
                string filePath = xlsPath.Substring(0, ix);
                string fileName = xlsPath.Substring(ix + 1, xlsPath.Length - ix-1);

                //Assign each object in the Excel Object model to it's own object or the GC won't work properly
                _Application excelApp = new Microsoft.Office.Interop.Excel.ApplicationClass();
                _Workbook wkb = excelApp.Workbooks.Open(xlsPath, 2, true, Type.Missing, Type.Missing, Type.Missing,
                                                        Type.Missing, Type.Missing, Type.Missing, Type.Missing,
                                                        Type.Missing, Type.Missing, false, Type.Missing, Type.Missing);

                object[,] xlsArray;
                string line;
                string saveFile;                
                foreach (_Worksheet wks in wkb.Worksheets)
                {
                    saveFile = fileName.Replace(".xlsx", ".csv");
                    saveFile = saveFile.Replace(".xls", ".csv");
                    saveFile = saveFile.Replace(".csv", "_" + wks.Name + ".csv");
                    xlsArray = (object[,])wks.UsedRange.get_Value(XlRangeValueDataType.xlRangeValueDefault);
                    if (xlsArray.Length > 0)
                    {
                        using (StreamWriter sw = File.CreateText(filePath + "\\" +saveFile))
                        {
                            for (long i = 0; i < xlsArray.GetUpperBound(0); i++)
                            {
                                line = "";
                                for (long j = 0; j < xlsArray.GetUpperBound(1); j++)
                                {
                                    line = line + xlsArray[i, j] + ",";
                                }
                                sw.WriteLine(line.Substring(0, line.Length - 1));
                            }
                        }
                    }
                }
                wkb.Close(false, Missing.Value, Missing.Value);                
                return fileName;
            }
            catch (Exception e)
            {
                return e.Message;
            }

        }

Open in new window

0
Comment
Question by:lm78
  • 9
  • 6
  • 5
  • +2
23 Comments
 
LVL 33

Assisted Solution

by:Todd Gerbert
Todd Gerbert earned 400 total points
ID: 35475909
You could try using OleDb to read the XLS files, and re-writing them to CSV's manually using a System.IO.FileStream.  Should work more reliably than using COM with Excel (and run quicker), but might require a few more steps on your part.  I think one or both of these links is VB.Net, but the methods/concepts still apply to C# (just imagine every line of code ends with a semi-colon to convert from VB.Net to C# ;) ).

http://vb.net-informations.com/excel-2007/vb.net_excel_oledb.htm
http://support.microsoft.com/kb/316934
0
 
LVL 45

Accepted Solution

by:
AndyAinscow earned 600 total points
ID: 35475957
The .net memory management, by design, will assign memory from specific 'heaps' of memory.  This menas that you might have plenty of memory available but still not be able to use it.
IF this is the cause of your problem then it is by design of .net - you need to recode or use a classical language (such as C++) where this limitation isn't built in.
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35476053
.Net should happily continue to allocate memory to you up to a maximum of 2GB (or possibly 3GB if the approprite boot entry has been set) on a 32-bit process, and 8TB (terabytes) on a 64-bit process.

Instead of sucking the entire contents of the XLS, read a line, convert it, write it, read the next line, etc.   The Excel COM model also probably exposes a method that will allow you to use Excel's built-in export to CSV.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:lm78
ID: 35476078
I thought of that - reading line by line, but thought that would be extremely slow and so wanted to avoid it. I couldn't find the excel built in converter to csv though? Tried doing SaveAs but threw an exception!
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35476124
The Excel COM model also probably exposes a method that will allow you to use Excel's built-in export to CSV.

But, I still recommend avoiding COM to Office applications - I believe using OleDb to read the XLS files will work well for you in this instance.
0
 

Author Comment

by:lm78
ID: 35476156
Hmm,ok, will try that. Bit hard up on time right now.
0
 
LVL 45

Expert Comment

by:AndyAinscow
ID: 35476202
@tgerbert - AFAIK no, .NET uses four separate, and differently sized, heaps of memory for different objects.  It is possible to run out of memory in a .net app despite there being more than enough memory available for a classic program to do the same memory assigment.
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35476499
Heaps aren't limited in size (though they may start out with different sizes) except by the available address space - wich is 2 to the 32nd on 32-bit processes, and 2 to the 64th on 64-bit processes, and Windows reserves half for it's own use, thus the 2GB and 8TB limits for 32- and 64-bit processes, respectively.  The stack is limited and will much more quickly run out memory (I think it's 1MB in .Net), but space to store reference types occurs on the heap (though the pointer to that object might be on the stack), and in this case might even be allocated by Excel's process.

Of course the .Net runtime is going to consume some of that available memory, so you won't quite get the full 2GB.

You can run this code successfully as a 64-bit process, only the top loop completes as a 32-bit process.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class Program
{
	static void Main(string[] args)
	{
		List<byte[]> byteArrays = new List<byte[]>();
		for (int i = 0; i < 1200000; i++)
			byteArrays.Add(new byte[1024]);

		Console.WriteLine("1.2GB allocated.");

		byteArrays = new List<byte[]>();
		for (int i = 0; i < 4194304; i++)
			byteArrays.Add(new byte[1024]);
		Console.WriteLine("4GB allocated.");

		Console.ReadKey();
	}
}

Open in new window

0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35476519
Rather, I should say heaps can be limited in size - but I don't think that's the case with .Net (or if they are it's very close to maximum addressable amount for a process).
0
 
LVL 45

Expert Comment

by:AndyAinscow
ID: 35476947
This is some years ago, based on a couple of questions here at EE and some articles - the Q.s seemed to back the articles up:  For performance reasons the total memory is split into 4 differently sized heaps which do not overlap.  Maybe there has been changes in the meanwhile - I hope so.
0
 
LVL 45

Expert Comment

by:AndyAinscow
ID: 35477545
0
 
LVL 8

Expert Comment

by:cubaman_24
ID: 35482301
Hello:
Wich version of excel are we talking about? If it's office 2007, you could decompress the excel file, get the underlying xml and use an xslt template to transform the data in the desired csv output. This would be much faster and less resource consuming than working with excel interop. Let me know if you want more details.
0
 

Author Comment

by:lm78
ID: 35483822
@cubaman: That sounds like an interesting option - give me more details pls. It's 2007, yes.
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35483838
@lm78

That isn't necessary.  The OleDb provider does all that work for you - see my first post (http:#a35475909).

0
 

Author Comment

by:lm78
ID: 35483876
haha, am being lazy, i was half way through trying out the ADO solution!
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35483997
@AndyAinscow

Interesting link - what he's describing there is managed heaps (which I believe are really just a logical organization, all living on the same process heap), and heap fragmentation, which happens in any programming language (actually, since lower-level languages like C and C++ don't have the benefit of a Garbage Collector they tend to be more prone), not necessarily a limit nor a limit of .Net.

Although it did bring to mind that lm78 might be able to circumvent the OutOfMemory condition by running a garbage collection at the bottom of the "foreach (_Worksheet wks in wkb.Worksheets)" loop, and/or allocating a single statically sized array and re-using it.

@lm78
If you find this comment helpful you can credit AndyAinscow's comment above (http:#a35477545) since he pointed out the issue.
0
 

Author Comment

by:lm78
ID: 35484047
I thought of doing a GC.Collect at the end of the foreach worksheet - but not sure if that will help as when i open the workbook, it loads the whole workbook to memory?
Or does it load it all on workbook open?
0
 
LVL 8

Expert Comment

by:cubaman_24
ID: 35484053
Hello lm78:
As you may know, the new office format is just a zip file with a bunch of xml, images and so on inside. Take any excel 2007, change the extension to ".zip" and open it. You'll find a fixed directory structure with xml files inside. In directory named \xl\worksheets\ you'll find several xml files, one for every sheet of your document. Their names also match the book sheet names.
It's a trivial task to write an xslt transform to convert the data to csv file. No excel installation, no interop, no PIAS. And really fast and efficient. :-)
Hope this helps. Come back if in doubt. Best regards.
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35484131
>> I thought of doing a GC.Collect at the end of the foreach worksheet - but not sure if that will help as when i open the workbook, it loads the whole workbook to memory? Or does it load it all on workbook open

Well, your foreach loop is allocating a new object[,] xlsArray array with each iteration, so each successive load of a worksheet doesn't necessarily free the memory used by the previous. That's my thinking, anyway. ;)
0
 
LVL 45

Expert Comment

by:AndyAinscow
ID: 35484153
@tgerbert
But isn't a managed object using memory from a managed heap?

I was alerted to this some time ago with a particular question - something like why do I get an out of memory exception one the 2nd or 3rd time I go through a loop.  From what I could tell everything was done correctly, objects going out of scope, GC being run .... but it would still crash with an out of memory but never the first time through the loop.  Everything pointed to this being the cause.

Anyway - it might not be the cause of the problem here, just a possibility one has to consider.
0
 

Author Comment

by:lm78
ID: 35484160
True. let me try that - have to leave now but will get back after the bank holiday and close this thread then...thanks so much for all this, has been totally enlightening! :-)
0
 
LVL 33

Expert Comment

by:Todd Gerbert
ID: 35484533
@AndyAinscow

Yes, but a managed heap doesn't necessarily correspond to a "real" OS-level heap, and doesn't represent a memory limit.  I seem to recall reading that a single managed heap might even consist of more than one non-contiguous regions of memory allocated out of the process's default heap (hard to find documentation to that level of detail  though).
0
 

Expert Comment

by:zhshqzyc
ID: 35484694
I think that you may use VBA instead of C# to convert xls files to csv.
using VBA
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction: Dialogs (2) modeless dialog and a worker thread.  Handling data shared between threads.  Recursive functions. Continuing from the tenth article about sudoku.   Last article we worked with a modal dialog to help maintain informat…
Ready to get certified? Check out some courses that help you prepare for third-party exams.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Is your data getting by on basic protection measures? In today’s climate of debilitating malware and ransomware—like WannaCry—that may not be enough. You need to establish more than basics, like a recovery plan that protects both data and endpoints.…
Suggested Courses
Course of the Month21 days, 3 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question