Random Access File/ArrayList Replacement

I have written an application that does stratified statistical sampling for auditing sales and use tax files. In VB6 I was able to use random access files to store each stratum's data for random sampling. It was relatively quick and easy. When I converted to Visual Basic,Net I had to find another way to do it. What I did was use an ArrayList of ArrayLists. Each stratum had an ArrayList containing the invoice records to be audited. It was also relatively easy to do. The only problem occurs because the ArrayList is on the heap and eventually you run out of memory. Also, I used a binary formatter to store the stratum records on disk to be read in as necessary. Everything worked great until a user wanted to do an audit of an invoice file of over 32 million records. We can read the file and create the ArrayList of ArrayLists but the program hangs on binary formatting. Clearly not the best way to go.
I would like suggestions on the best way to rewrite this section of code with the following constraints:
1. It must be backward compatible as samples must be the same for the same starting seed. That should not be a problem as the random number generator is generating the record number to be selected in the array.
2. The first method required a structure with fixed format records. The second one did not so I relaxed the requirement. Now I have to live with both.

I have looked at as many resources as I can find and the best solution seems to be in http://www.experts-exchange.com/Programming/Languages/.NET/Visual_Basic.NET/Q_21812439.html  but I would like suggestions on how to proceed. I'm not sure if a database approach is suitable in this scenario although I think it was suggested in the referenced solution when datasets are large.  The best solution would be one where I can write the data to disk and retrieve it at will without having to count bytes, etc.

Thanks in advance for suggestions. I can provide code if necessary, but I think theory and best practices ideas are what I need now.
Who is Participating?

Improve company productivity with a Business Account.Sign Up

Miguel OzConnect With a Mentor Software EngineerCommented:
If you cannot control the hardware then you can not keep the file in memory.
Two alternatives:
A1: Load in chunks and save your intermidiate results to a temp files. Very complex as you need to check your rules and you are not reaching memory limits.
A2: Load your files to a database (SQL server or MY SQL) and query the results in the way you need it. Again you may need to retreive results in chunks to avoid out of memory.

Note: I did not put "go back to random access" as an option because you mention you cannot use it right now but consider if going back to that solution may be less painful than alternatives above.
Gerald ConnollyCommented:
So how big is your data structure going to be for 32 million records? Or for 100 million or bigger?
rkulpAuthor Commented:
For 32 million it is about 3.2GB so for 100 million it would be about 9.5GB. I'm hoping it is simply a matter of memory since many computers now have 16GB or more.
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Miguel OzSoftware EngineerCommented:
Do you really need the 32 million records in memory? Even if you can find big x64 server you will run out of memory at a certain stage.

Some suggestions in the mean time:
1) Use List<List<string>> (generic list) instead of ArrayList of ArrayLists  or better string[,]. It takes less memory and you avoid boxing/unboxing operations.
2) I have to deal with big files before, in my case we load chunks of the file at the time (e.g 400K lines at the time). The first time you read 400k and process it in your program, the next time the next 400K section and so on until you finish the file.
3) Another possibility to split the invoice file in multiple files you can process say 8 reasonable sized files instead of one big file.
rkulpAuthor Commented:

The problem I have is that the stratification is by invoice amount and the invoices are presented as the taxpayer stores them, usually not by amount. I have to separate them into strata as they are read. Using random access files this was a snap and they were not kept in memory. With the loss of random access I went with what I could figure out how to do. In retrospect, that was probably not the best way to go but I am stuck with it for now. When I make a change, I have to make it backward compatible so I have to be careful. I will look at List<List<string>> . The items are not strings now but that should not be a big problem. Anyway, as soon as the entire file is stratified, I have to calculate sample sizes and draw random samples from the strata. Drawing the sample is where random access is used. After all the strata samples are drawn, they are combined into one larger sample and sorted into the same order as the original invoice file. Normally the sample is exported to an Excel spreadsheet for the auditor.
Gerald ConnollyCommented:
So storing the files on an SSD will not solve your speed problem?
rkulpAuthor Commented:
I can't control what equipment the user has. An SSD would be faster than a hard drive. However, speed is not the issue here. I expect I will have to go with the method suggested in my referenced post. I was hoping to avoid counting bytes to find the starting point of a record.
rkulpAuthor Commented:
Thanks for your comments. I ended up going back to random access using FileGet and FilePut. However, it is very slow compared to what I was doing so I put in a choice for the user to choose the method with a warning that using the new method (random access) should be done only when the older method fails.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.