We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

Handling very large xls in c#

lm78
lm78 asked
on
Medium Priority
4,161 Views
Last Modified: 2013-11-17
Hi there,
I need to process some very large .xls files from C#. The particular method's purpose is quite simple, it has to take the .xls and convert all the worksheets into separate .csv files. However, it throws an OutOfMemory exception when I try to convert the range into an object array.
The files are usually about 250 MB plus.
Any ideas on how this can be done please?

Thanks,
Leo
public static string CreateCsvFromXls(string xlsPath)
        {
            try
            {
                if (!File.Exists(xlsPath))
                    throw (new Exception("File Not found"));

                int ix = xlsPath.LastIndexOfAny(new char[] { '\\' });
                string filePath = xlsPath.Substring(0, ix);
                string fileName = xlsPath.Substring(ix + 1, xlsPath.Length - ix-1);

                //Assign each object in the Excel Object model to it's own object or the GC won't work properly
                _Application excelApp = new Microsoft.Office.Interop.Excel.ApplicationClass();
                _Workbook wkb = excelApp.Workbooks.Open(xlsPath, 2, true, Type.Missing, Type.Missing, Type.Missing,
                                                        Type.Missing, Type.Missing, Type.Missing, Type.Missing,
                                                        Type.Missing, Type.Missing, false, Type.Missing, Type.Missing);

                object[,] xlsArray;
                string line;
                string saveFile;                
                foreach (_Worksheet wks in wkb.Worksheets)
                {
                    saveFile = fileName.Replace(".xlsx", ".csv");
                    saveFile = saveFile.Replace(".xls", ".csv");
                    saveFile = saveFile.Replace(".csv", "_" + wks.Name + ".csv");
                    xlsArray = (object[,])wks.UsedRange.get_Value(XlRangeValueDataType.xlRangeValueDefault);
                    if (xlsArray.Length > 0)
                    {
                        using (StreamWriter sw = File.CreateText(filePath + "\\" +saveFile))
                        {
                            for (long i = 0; i < xlsArray.GetUpperBound(0); i++)
                            {
                                line = "";
                                for (long j = 0; j < xlsArray.GetUpperBound(1); j++)
                                {
                                    line = line + xlsArray[i, j] + ",";
                                }
                                sw.WriteLine(line.Substring(0, line.Length - 1));
                            }
                        }
                    }
                }
                wkb.Close(false, Missing.Value, Missing.Value);                
                return fileName;
            }
            catch (Exception e)
            {
                return e.Message;
            }

        }

Open in new window

Comment
Watch Question

Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Freelance programmer / Consultant
CERTIFIED EXPERT
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
.Net should happily continue to allocate memory to you up to a maximum of 2GB (or possibly 3GB if the approprite boot entry has been set) on a 32-bit process, and 8TB (terabytes) on a 64-bit process.

Instead of sucking the entire contents of the XLS, read a line, convert it, write it, read the next line, etc.   The Excel COM model also probably exposes a method that will allow you to use Excel's built-in export to CSV.

Author

Commented:
I thought of that - reading line by line, but thought that would be extremely slow and so wanted to avoid it. I couldn't find the excel built in converter to csv though? Tried doing SaveAs but threw an exception!
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
The Excel COM model also probably exposes a method that will allow you to use Excel's built-in export to CSV.

But, I still recommend avoiding COM to Office applications - I believe using OleDb to read the XLS files will work well for you in this instance.

Author

Commented:
Hmm,ok, will try that. Bit hard up on time right now.
AndyAinscowFreelance programmer / Consultant
CERTIFIED EXPERT

Commented:
@tgerbert - AFAIK no, .NET uses four separate, and differently sized, heaps of memory for different objects.  It is possible to run out of memory in a .net app despite there being more than enough memory available for a classic program to do the same memory assigment.
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
Heaps aren't limited in size (though they may start out with different sizes) except by the available address space - wich is 2 to the 32nd on 32-bit processes, and 2 to the 64th on 64-bit processes, and Windows reserves half for it's own use, thus the 2GB and 8TB limits for 32- and 64-bit processes, respectively.  The stack is limited and will much more quickly run out memory (I think it's 1MB in .Net), but space to store reference types occurs on the heap (though the pointer to that object might be on the stack), and in this case might even be allocated by Excel's process.

Of course the .Net runtime is going to consume some of that available memory, so you won't quite get the full 2GB.

You can run this code successfully as a 64-bit process, only the top loop completes as a 32-bit process.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class Program
{
	static void Main(string[] args)
	{
		List<byte[]> byteArrays = new List<byte[]>();
		for (int i = 0; i < 1200000; i++)
			byteArrays.Add(new byte[1024]);

		Console.WriteLine("1.2GB allocated.");

		byteArrays = new List<byte[]>();
		for (int i = 0; i < 4194304; i++)
			byteArrays.Add(new byte[1024]);
		Console.WriteLine("4GB allocated.");

		Console.ReadKey();
	}
}

Open in new window

Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
Rather, I should say heaps can be limited in size - but I don't think that's the case with .Net (or if they are it's very close to maximum addressable amount for a process).
AndyAinscowFreelance programmer / Consultant
CERTIFIED EXPERT

Commented:
This is some years ago, based on a couple of questions here at EE and some articles - the Q.s seemed to back the articles up:  For performance reasons the total memory is split into 4 differently sized heaps which do not overlap.  Maybe there has been changes in the meanwhile - I hope so.
AndyAinscowFreelance programmer / Consultant
CERTIFIED EXPERT

Commented:
Hello:
Wich version of excel are we talking about? If it's office 2007, you could decompress the excel file, get the underlying xml and use an xslt template to transform the data in the desired csv output. This would be much faster and less resource consuming than working with excel interop. Let me know if you want more details.

Author

Commented:
@cubaman: That sounds like an interesting option - give me more details pls. It's 2007, yes.
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
@lm78

That isn't necessary.  The OleDb provider does all that work for you - see my first post (http:#a35475909).

Author

Commented:
haha, am being lazy, i was half way through trying out the ADO solution!
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
@AndyAinscow

Interesting link - what he's describing there is managed heaps (which I believe are really just a logical organization, all living on the same process heap), and heap fragmentation, which happens in any programming language (actually, since lower-level languages like C and C++ don't have the benefit of a Garbage Collector they tend to be more prone), not necessarily a limit nor a limit of .Net.

Although it did bring to mind that lm78 might be able to circumvent the OutOfMemory condition by running a garbage collection at the bottom of the "foreach (_Worksheet wks in wkb.Worksheets)" loop, and/or allocating a single statically sized array and re-using it.

@lm78
If you find this comment helpful you can credit AndyAinscow's comment above (http:#a35477545) since he pointed out the issue.

Author

Commented:
I thought of doing a GC.Collect at the end of the foreach worksheet - but not sure if that will help as when i open the workbook, it loads the whole workbook to memory?
Or does it load it all on workbook open?
Hello lm78:
As you may know, the new office format is just a zip file with a bunch of xml, images and so on inside. Take any excel 2007, change the extension to ".zip" and open it. You'll find a fixed directory structure with xml files inside. In directory named \xl\worksheets\ you'll find several xml files, one for every sheet of your document. Their names also match the book sheet names.
It's a trivial task to write an xslt transform to convert the data to csv file. No excel installation, no interop, no PIAS. And really fast and efficient. :-)
Hope this helps. Come back if in doubt. Best regards.
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
>> I thought of doing a GC.Collect at the end of the foreach worksheet - but not sure if that will help as when i open the workbook, it loads the whole workbook to memory? Or does it load it all on workbook open

Well, your foreach loop is allocating a new object[,] xlsArray array with each iteration, so each successive load of a worksheet doesn't necessarily free the memory used by the previous. That's my thinking, anyway. ;)
AndyAinscowFreelance programmer / Consultant
CERTIFIED EXPERT

Commented:
@tgerbert
But isn't a managed object using memory from a managed heap?

I was alerted to this some time ago with a particular question - something like why do I get an out of memory exception one the 2nd or 3rd time I go through a loop.  From what I could tell everything was done correctly, objects going out of scope, GC being run .... but it would still crash with an out of memory but never the first time through the loop.  Everything pointed to this being the cause.

Anyway - it might not be the cause of the problem here, just a possibility one has to consider.

Author

Commented:
True. let me try that - have to leave now but will get back after the bank holiday and close this thread then...thanks so much for all this, has been totally enlightening! :-)
Todd GerbertSenior Engineer
CERTIFIED EXPERT
Top Expert 2010

Commented:
@AndyAinscow

Yes, but a managed heap doesn't necessarily correspond to a "real" OS-level heap, and doesn't represent a memory limit.  I seem to recall reading that a single managed heap might even consist of more than one non-contiguous regions of memory allocated out of the process's default heap (hard to find documentation to that level of detail  though).

Commented:
I think that you may use VBA instead of C# to convert xls files to csv.
using VBA
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.