Link to home
Start Free TrialLog in
Avatar of rkulp
rkulpFlag for United States of America

asked on

Random Access File/ArrayList Replacement

I have written an application that does stratified statistical sampling for auditing sales and use tax files. In VB6 I was able to use random access files to store each stratum's data for random sampling. It was relatively quick and easy. When I converted to Visual Basic,Net I had to find another way to do it. What I did was use an ArrayList of ArrayLists. Each stratum had an ArrayList containing the invoice records to be audited. It was also relatively easy to do. The only problem occurs because the ArrayList is on the heap and eventually you run out of memory. Also, I used a binary formatter to store the stratum records on disk to be read in as necessary. Everything worked great until a user wanted to do an audit of an invoice file of over 32 million records. We can read the file and create the ArrayList of ArrayLists but the program hangs on binary formatting. Clearly not the best way to go.
I would like suggestions on the best way to rewrite this section of code with the following constraints:
1. It must be backward compatible as samples must be the same for the same starting seed. That should not be a problem as the random number generator is generating the record number to be selected in the array.
2. The first method required a structure with fixed format records. The second one did not so I relaxed the requirement. Now I have to live with both.

I have looked at as many resources as I can find and the best solution seems to be in https://www.experts-exchange.com/questions/21812439/Random-Access-Files.html  but I would like suggestions on how to proceed. I'm not sure if a database approach is suitable in this scenario although I think it was suggested in the referenced solution when datasets are large.  The best solution would be one where I can write the data to disk and retrieve it at will without having to count bytes, etc.

Thanks in advance for suggestions. I can provide code if necessary, but I think theory and best practices ideas are what I need now.
Avatar of Gerald Connolly
Gerald Connolly
Flag of Australia image

So how big is your data structure going to be for 32 million records? Or for 100 million or bigger?
Avatar of rkulp

ASKER

For 32 million it is about 3.2GB so for 100 million it would be about 9.5GB. I'm hoping it is simply a matter of memory since many computers now have 16GB or more.
Do you really need the 32 million records in memory? Even if you can find big x64 server you will run out of memory at a certain stage.

Some suggestions in the mean time:
1) Use List<List<string>> (generic list) instead of ArrayList of ArrayLists  or better string[,]. It takes less memory and you avoid boxing/unboxing operations.
2) I have to deal with big files before, in my case we load chunks of the file at the time (e.g 400K lines at the time). The first time you read 400k and process it in your program, the next time the next 400K section and so on until you finish the file.
3) Another possibility to split the invoice file in multiple files you can process say 8 reasonable sized files instead of one big file.
Avatar of rkulp

ASKER

mas_oz2003,

The problem I have is that the stratification is by invoice amount and the invoices are presented as the taxpayer stores them, usually not by amount. I have to separate them into strata as they are read. Using random access files this was a snap and they were not kept in memory. With the loss of random access I went with what I could figure out how to do. In retrospect, that was probably not the best way to go but I am stuck with it for now. When I make a change, I have to make it backward compatible so I have to be careful. I will look at List<List<string>> . The items are not strings now but that should not be a big problem. Anyway, as soon as the entire file is stratified, I have to calculate sample sizes and draw random samples from the strata. Drawing the sample is where random access is used. After all the strata samples are drawn, they are combined into one larger sample and sorted into the same order as the original invoice file. Normally the sample is exported to an Excel spreadsheet for the auditor.
So storing the files on an SSD will not solve your speed problem?
Avatar of rkulp

ASKER

I can't control what equipment the user has. An SSD would be faster than a hard drive. However, speed is not the issue here. I expect I will have to go with the method suggested in my referenced post. I was hoping to avoid counting bytes to find the starting point of a record.
ASKER CERTIFIED SOLUTION
Avatar of Miguel Oz
Miguel Oz
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rkulp

ASKER

Thanks for your comments. I ended up going back to random access using FileGet and FilePut. However, it is very slow compared to what I was doing so I put in a choice for the user to choose the method with a warning that using the new method (random access) should be done only when the older method fails.