Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Random Access File/ArrayList Replacement

Posted on 2013-07-01
8
Medium Priority
?
512 Views
Last Modified: 2013-07-06
I have written an application that does stratified statistical sampling for auditing sales and use tax files. In VB6 I was able to use random access files to store each stratum's data for random sampling. It was relatively quick and easy. When I converted to Visual Basic,Net I had to find another way to do it. What I did was use an ArrayList of ArrayLists. Each stratum had an ArrayList containing the invoice records to be audited. It was also relatively easy to do. The only problem occurs because the ArrayList is on the heap and eventually you run out of memory. Also, I used a binary formatter to store the stratum records on disk to be read in as necessary. Everything worked great until a user wanted to do an audit of an invoice file of over 32 million records. We can read the file and create the ArrayList of ArrayLists but the program hangs on binary formatting. Clearly not the best way to go.
I would like suggestions on the best way to rewrite this section of code with the following constraints:
1. It must be backward compatible as samples must be the same for the same starting seed. That should not be a problem as the random number generator is generating the record number to be selected in the array.
2. The first method required a structure with fixed format records. The second one did not so I relaxed the requirement. Now I have to live with both.

I have looked at as many resources as I can find and the best solution seems to be in http://www.experts-exchange.com/Programming/Languages/.NET/Visual_Basic.NET/Q_21812439.html  but I would like suggestions on how to proceed. I'm not sure if a database approach is suitable in this scenario although I think it was suggested in the referenced solution when datasets are large.  The best solution would be one where I can write the data to disk and retrieve it at will without having to count bytes, etc.

Thanks in advance for suggestions. I can provide code if necessary, but I think theory and best practices ideas are what I need now.
0
Comment
Question by:rkulp
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
8 Comments
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 39292770
So how big is your data structure going to be for 32 million records? Or for 100 million or bigger?
0
 
LVL 1

Author Comment

by:rkulp
ID: 39292780
For 32 million it is about 3.2GB so for 100 million it would be about 9.5GB. I'm hoping it is simply a matter of memory since many computers now have 16GB or more.
0
 
LVL 36

Expert Comment

by:Miguel Oz
ID: 39295044
Do you really need the 32 million records in memory? Even if you can find big x64 server you will run out of memory at a certain stage.

Some suggestions in the mean time:
1) Use List<List<string>> (generic list) instead of ArrayList of ArrayLists  or better string[,]. It takes less memory and you avoid boxing/unboxing operations.
2) I have to deal with big files before, in my case we load chunks of the file at the time (e.g 400K lines at the time). The first time you read 400k and process it in your program, the next time the next 400K section and so on until you finish the file.
3) Another possibility to split the invoice file in multiple files you can process say 8 reasonable sized files instead of one big file.
0
Veeam Task Manager for Hyper-V

Task Manager for Hyper-V provides critical information that allows you to monitor Hyper-V performance by displaying real-time views of CPU and memory at the individual VM-level, so you can quickly identify which VMs are using host resources.

 
LVL 1

Author Comment

by:rkulp
ID: 39295281
mas_oz2003,

The problem I have is that the stratification is by invoice amount and the invoices are presented as the taxpayer stores them, usually not by amount. I have to separate them into strata as they are read. Using random access files this was a snap and they were not kept in memory. With the loss of random access I went with what I could figure out how to do. In retrospect, that was probably not the best way to go but I am stuck with it for now. When I make a change, I have to make it backward compatible so I have to be careful. I will look at List<List<string>> . The items are not strings now but that should not be a big problem. Anyway, as soon as the entire file is stratified, I have to calculate sample sizes and draw random samples from the strata. Drawing the sample is where random access is used. After all the strata samples are drawn, they are combined into one larger sample and sorted into the same order as the original invoice file. Normally the sample is exported to an Excel spreadsheet for the auditor.
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 39295631
So storing the files on an SSD will not solve your speed problem?
0
 
LVL 1

Author Comment

by:rkulp
ID: 39296448
I can't control what equipment the user has. An SSD would be faster than a hard drive. However, speed is not the issue here. I expect I will have to go with the method suggested in my referenced post. I was hoping to avoid counting bytes to find the starting point of a record.
0
 
LVL 36

Accepted Solution

by:
Miguel Oz earned 2000 total points
ID: 39298628
If you cannot control the hardware then you can not keep the file in memory.
Two alternatives:
A1: Load in chunks and save your intermidiate results to a temp files. Very complex as you need to check your rules and you are not reaching memory limits.
A2: Load your files to a database (SQL server or MY SQL) and query the results in the way you need it. Again you may need to retreive results in chunks to avoid out of memory.

Note: I did not put "go back to random access" as an option because you mention you cannot use it right now but consider if going back to that solution may be less painful than alternatives above.
0
 
LVL 1

Author Closing Comment

by:rkulp
ID: 39303971
Thanks for your comments. I ended up going back to random access using FileGet and FilePut. However, it is very slow compared to what I was doing so I put in a choice for the user to choose the method with a warning that using the new method (random access) should be done only when the older method fails.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This document covers how to connect to SQL Server and browse its contents.  It is meant for those new to Visual Studio and/or working with Microsoft SQL Server.  It is not a guide to building SQL Server database connections in your code.  This is mo…
For those of you who don't follow the news, or just happen to live under rocks, Microsoft Research released a beta SDK (http://www.microsoft.com/en-us/download/details.aspx?id=27876) for the Xbox 360 Kinect. If you don't know what a Kinect is (http:…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
We’ve all felt that sense of false security before—locking down external access to a database or component and feeling like we’ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many w…

604 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question