Solved

Random Access File/ArrayList Replacement

Posted on 2013-07-01
8
490 Views
Last Modified: 2013-07-06
I have written an application that does stratified statistical sampling for auditing sales and use tax files. In VB6 I was able to use random access files to store each stratum's data for random sampling. It was relatively quick and easy. When I converted to Visual Basic,Net I had to find another way to do it. What I did was use an ArrayList of ArrayLists. Each stratum had an ArrayList containing the invoice records to be audited. It was also relatively easy to do. The only problem occurs because the ArrayList is on the heap and eventually you run out of memory. Also, I used a binary formatter to store the stratum records on disk to be read in as necessary. Everything worked great until a user wanted to do an audit of an invoice file of over 32 million records. We can read the file and create the ArrayList of ArrayLists but the program hangs on binary formatting. Clearly not the best way to go.
I would like suggestions on the best way to rewrite this section of code with the following constraints:
1. It must be backward compatible as samples must be the same for the same starting seed. That should not be a problem as the random number generator is generating the record number to be selected in the array.
2. The first method required a structure with fixed format records. The second one did not so I relaxed the requirement. Now I have to live with both.

I have looked at as many resources as I can find and the best solution seems to be in http://www.experts-exchange.com/Programming/Languages/.NET/Visual_Basic.NET/Q_21812439.html  but I would like suggestions on how to proceed. I'm not sure if a database approach is suitable in this scenario although I think it was suggested in the referenced solution when datasets are large.  The best solution would be one where I can write the data to disk and retrieve it at will without having to count bytes, etc.

Thanks in advance for suggestions. I can provide code if necessary, but I think theory and best practices ideas are what I need now.
0
Comment
Question by:rkulp
  • 4
  • 2
  • 2
8 Comments
 
LVL 16

Expert Comment

by:Gerald Connolly
Comment Utility
So how big is your data structure going to be for 32 million records? Or for 100 million or bigger?
0
 
LVL 1

Author Comment

by:rkulp
Comment Utility
For 32 million it is about 3.2GB so for 100 million it would be about 9.5GB. I'm hoping it is simply a matter of memory since many computers now have 16GB or more.
0
 
LVL 35

Expert Comment

by:Miguel Oz
Comment Utility
Do you really need the 32 million records in memory? Even if you can find big x64 server you will run out of memory at a certain stage.

Some suggestions in the mean time:
1) Use List<List<string>> (generic list) instead of ArrayList of ArrayLists  or better string[,]. It takes less memory and you avoid boxing/unboxing operations.
2) I have to deal with big files before, in my case we load chunks of the file at the time (e.g 400K lines at the time). The first time you read 400k and process it in your program, the next time the next 400K section and so on until you finish the file.
3) Another possibility to split the invoice file in multiple files you can process say 8 reasonable sized files instead of one big file.
0
 
LVL 1

Author Comment

by:rkulp
Comment Utility
mas_oz2003,

The problem I have is that the stratification is by invoice amount and the invoices are presented as the taxpayer stores them, usually not by amount. I have to separate them into strata as they are read. Using random access files this was a snap and they were not kept in memory. With the loss of random access I went with what I could figure out how to do. In retrospect, that was probably not the best way to go but I am stuck with it for now. When I make a change, I have to make it backward compatible so I have to be careful. I will look at List<List<string>> . The items are not strings now but that should not be a big problem. Anyway, as soon as the entire file is stratified, I have to calculate sample sizes and draw random samples from the strata. Drawing the sample is where random access is used. After all the strata samples are drawn, they are combined into one larger sample and sorted into the same order as the original invoice file. Normally the sample is exported to an Excel spreadsheet for the auditor.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 16

Expert Comment

by:Gerald Connolly
Comment Utility
So storing the files on an SSD will not solve your speed problem?
0
 
LVL 1

Author Comment

by:rkulp
Comment Utility
I can't control what equipment the user has. An SSD would be faster than a hard drive. However, speed is not the issue here. I expect I will have to go with the method suggested in my referenced post. I was hoping to avoid counting bytes to find the starting point of a record.
0
 
LVL 35

Accepted Solution

by:
Miguel Oz earned 500 total points
Comment Utility
If you cannot control the hardware then you can not keep the file in memory.
Two alternatives:
A1: Load in chunks and save your intermidiate results to a temp files. Very complex as you need to check your rules and you are not reaching memory limits.
A2: Load your files to a database (SQL server or MY SQL) and query the results in the way you need it. Again you may need to retreive results in chunks to avoid out of memory.

Note: I did not put "go back to random access" as an option because you mention you cannot use it right now but consider if going back to that solution may be less painful than alternatives above.
0
 
LVL 1

Author Closing Comment

by:rkulp
Comment Utility
Thanks for your comments. I ended up going back to random access using FileGet and FilePut. However, it is very slow compared to what I was doing so I put in a choice for the user to choose the method with a warning that using the new method (random access) should be done only when the older method fails.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Recently while returning home from work my wife (another .NET developer) was murmuring something. On further poking she said that she has been assigned a task where she has to serialize and deserialize objects and she is afraid of serialization. Wha…
More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now