Solved

Random Access File/ArrayList Replacement

Posted on 2013-07-01
8
491 Views
Last Modified: 2013-07-06
I have written an application that does stratified statistical sampling for auditing sales and use tax files. In VB6 I was able to use random access files to store each stratum's data for random sampling. It was relatively quick and easy. When I converted to Visual Basic,Net I had to find another way to do it. What I did was use an ArrayList of ArrayLists. Each stratum had an ArrayList containing the invoice records to be audited. It was also relatively easy to do. The only problem occurs because the ArrayList is on the heap and eventually you run out of memory. Also, I used a binary formatter to store the stratum records on disk to be read in as necessary. Everything worked great until a user wanted to do an audit of an invoice file of over 32 million records. We can read the file and create the ArrayList of ArrayLists but the program hangs on binary formatting. Clearly not the best way to go.
I would like suggestions on the best way to rewrite this section of code with the following constraints:
1. It must be backward compatible as samples must be the same for the same starting seed. That should not be a problem as the random number generator is generating the record number to be selected in the array.
2. The first method required a structure with fixed format records. The second one did not so I relaxed the requirement. Now I have to live with both.

I have looked at as many resources as I can find and the best solution seems to be in http://www.experts-exchange.com/Programming/Languages/.NET/Visual_Basic.NET/Q_21812439.html  but I would like suggestions on how to proceed. I'm not sure if a database approach is suitable in this scenario although I think it was suggested in the referenced solution when datasets are large.  The best solution would be one where I can write the data to disk and retrieve it at will without having to count bytes, etc.

Thanks in advance for suggestions. I can provide code if necessary, but I think theory and best practices ideas are what I need now.
0
Comment
Question by:rkulp
  • 4
  • 2
  • 2
8 Comments
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 39292770
So how big is your data structure going to be for 32 million records? Or for 100 million or bigger?
0
 
LVL 1

Author Comment

by:rkulp
ID: 39292780
For 32 million it is about 3.2GB so for 100 million it would be about 9.5GB. I'm hoping it is simply a matter of memory since many computers now have 16GB or more.
0
 
LVL 35

Expert Comment

by:Miguel Oz
ID: 39295044
Do you really need the 32 million records in memory? Even if you can find big x64 server you will run out of memory at a certain stage.

Some suggestions in the mean time:
1) Use List<List<string>> (generic list) instead of ArrayList of ArrayLists  or better string[,]. It takes less memory and you avoid boxing/unboxing operations.
2) I have to deal with big files before, in my case we load chunks of the file at the time (e.g 400K lines at the time). The first time you read 400k and process it in your program, the next time the next 400K section and so on until you finish the file.
3) Another possibility to split the invoice file in multiple files you can process say 8 reasonable sized files instead of one big file.
0
 
LVL 1

Author Comment

by:rkulp
ID: 39295281
mas_oz2003,

The problem I have is that the stratification is by invoice amount and the invoices are presented as the taxpayer stores them, usually not by amount. I have to separate them into strata as they are read. Using random access files this was a snap and they were not kept in memory. With the loss of random access I went with what I could figure out how to do. In retrospect, that was probably not the best way to go but I am stuck with it for now. When I make a change, I have to make it backward compatible so I have to be careful. I will look at List<List<string>> . The items are not strings now but that should not be a big problem. Anyway, as soon as the entire file is stratified, I have to calculate sample sizes and draw random samples from the strata. Drawing the sample is where random access is used. After all the strata samples are drawn, they are combined into one larger sample and sorted into the same order as the original invoice file. Normally the sample is exported to an Excel spreadsheet for the auditor.
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 39295631
So storing the files on an SSD will not solve your speed problem?
0
 
LVL 1

Author Comment

by:rkulp
ID: 39296448
I can't control what equipment the user has. An SSD would be faster than a hard drive. However, speed is not the issue here. I expect I will have to go with the method suggested in my referenced post. I was hoping to avoid counting bytes to find the starting point of a record.
0
 
LVL 35

Accepted Solution

by:
Miguel Oz earned 500 total points
ID: 39298628
If you cannot control the hardware then you can not keep the file in memory.
Two alternatives:
A1: Load in chunks and save your intermidiate results to a temp files. Very complex as you need to check your rules and you are not reaching memory limits.
A2: Load your files to a database (SQL server or MY SQL) and query the results in the way you need it. Again you may need to retreive results in chunks to avoid out of memory.

Note: I did not put "go back to random access" as an option because you mention you cannot use it right now but consider if going back to that solution may be less painful than alternatives above.
0
 
LVL 1

Author Closing Comment

by:rkulp
ID: 39303971
Thanks for your comments. I ended up going back to random access using FileGet and FilePut. However, it is very slow compared to what I was doing so I put in a choice for the user to choose the method with a warning that using the new method (random access) should be done only when the older method fails.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The ECB site provides FX rates for major currencies since its inception in 1999 in the form of an XML feed. The files have the following format (reducted for brevity) (CODE) There are three files available HERE (http://www.ecb.europa.eu/stats/exch…
A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.

895 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now