asked on

C# App crashes when trying to load large datatable

Hi all!

I am trying to load an *enormous* datafile into an array or some other structure to manipulate it (100 million rows, though each row only about 7 bytes in total). I have found that XP windows freezes totally (need to hold down power button to turn it off!) when it gets to a certain size in memory. I've had some success loading the data into 3 arrays (an int16, an int32 and a byte for each record, eg 111, 22222, 3) without it crashing. However if I try to load the same data into a datatable, XP crashes.

I've looked at various options but am no expert. There seem to be things like weak references, BTREE's etc that all sound too complicated. All I really wanted was for the damn thing to be held in virtual memory and looked after transparently by the system! I've set the virtual memory to be system managed, but it doesn't seem to max that out. I searched out something on this site about a 1.2gig limitation per process, though I don't think I'm pushing that limit with this exercise.

SO, essentially I want my set of data to be available in a datatable, ready for me to sort/filter/manipulate using the datatable's methods. Why doesn't windows use virtual memory rather than crashing? What happened to the idea of 'managed code' that is meant to do the mem management for me?

My code is below... If you set the maxrecordstoread to 1000, it's fine of course. If you set it to the larger number (101 million!) it crashes after 5-6 million (see that I'm writing a log file as a 'black box'). I know it's a lot of records, but a DB is too slow for my purposes (trying to analyse trends in this data).

Many thanks guys!

private void button3_Click(object sender, EventArgs e)
{
// Input file data lines look like these if you are masochistic and want to generate a dummy set!:
//1,1248029,3
//1,1842128,4
//1,2238063,3
//1,1503895,4
//1,2207774,5
//1,2590061,3

DataTable dt = new DataTable();
dt.Columns.Add("movieid", typeof(Int16));
dt.Columns.Add("user", typeof(Int32));
dt.Columns.Add("rating", typeof(byte));

StreamReader sr = new StreamReader(File.OpenRead(@"d://joinedout.txt"));
string aLine;
Int64 counter = 0;
Int64 maxrecordstoread = 110000000;
//Int64 maxrecordstoread = 1000;
string[] fields;

StreamWriter sw = new StreamWriter("d://log.txt", true);

while (!sr.EndOfStream && counter < maxrecordstoread)
{
counter++;
aLine = sr.ReadLine();
fields = aLine.Split(',');

dt.Rows.Add((new Object[] { Convert.ToInt16(fields[0]), Convert.ToInt32(fields[1]), Convert.ToByte(fields[2]) }));
sw.Write(counter.ToString() + "\r\n");
}

label1.Text = "Loaded";

}

ozymandias

I think you are going to have a job here.
Even if you can load all the data how are you going to anlyse it without indexes and if you build indexes they will take up some space too, and take time to build.

Wouldn't you be better of leaving all this stuff in the database and building an OLAP cube over the top of it to do your analysis.
You only have a maximum of three dimensions to work with, once the cube was built it would be easy.

glinkot

ASKER

Hi ozymandias, the task (which is definitely a job as you say!) is probably to do a factor analysis and then cluster analysis on the stored data, which probably means about 7 or so floating point fields in a second table with 17000 rows (I am analysing movie preferences). I'm not sure how well a cube would lend itself to analysis like clusters etc? I haven't looked too much at the whole OLAP implementation thing though I have a fair idea of what they are etc.

Even if it isn't the right way, a key question is still why XP just freezes rather than giving me the message that it's going to try to increase Virtual Memory (or some other more graceful response!)

Gautham Janardhan

<(@"d://joinedout.txt"> can u inport this to an sql database and try reading it from there u would have ,ore performance i think..there must be limit to what a txt file could manage;

glinkot

ASKER

Hi gauthampj, I have tried querying the same from SQL and it crashes in the same way. Thanks for your efforts though mate!

Gautham Janardhan

what i ment was not the bulk as a whole we could page it like

while insertintg into a databse u could insert into a table which would have a auto num and u could retrieve data like

first 1-10,000
then necxt 10001 - 20000 like wise

mjmarlow

A process running in windows XP 32 bit version is limited to 2 GB of memory - virtual or not.

In your code where you "bulk" load the datatable be sure to wrap the operation where many rows are added with BeginLoadData();
EndLoadData();

Once your CPU intensive process dives into paged memory things get much slower.

Instrument your code to see what is happening with memory while you are loading the data.

Process p = Process.GetCurrentProcess();
long peakPagedMem = p.PeakPagedMemorySize64;
long peakVirtualMem = p.PeakVirtualMemorySize64;
long peakWorkingSet = p.PeakWorkingSet64;
long maxWorkingSet = p.MaxWorkingSet;

mjmarlow

Should have been:
IntPtr maxWorkingSet = p.MaxWorkingSet;

ozymandias

If you buiuld an OLAP cube using SQL Analysis Services you can then do a cluster analysis using the data mining model wizard.
I would think of giving it a go.

glinkot

ASKER

Thanks everyone for your help, much appreciated! I guess it's looking like none of those BTREE or weak references are the right area to examine. Notes below:

*mjmarlow*

Great suggestions, a couple of quick q's-
a) The BeginLoadData and EndLoadData 'don't exist in the current context' for me (or appear in intellisense). In the object browser I see them under the System.Data.DataTable namespace, but even typing all that stuff explicitly does not help! I've already put 'using' statements for Data, but underneath Data there is no 'datatable' section.

b) All those process commands - Would I capture those every few hundred records? What would I do with the info once obtained?

*ozymandias*

I've only got SQL express 2k5 on my machine right now, and the full version wanted a server OS and told me to go jump! I might give that a whirl if I can get the IT guys at work to spare me an instance of it... :)

*gauthamp*

The idea of doing it in chunks is something I did give thought to, but unfortunately I need the whole dataset accessible at once.

boy8964

You need BIG RAM for your coumputer than anything else in this case!

ASKER CERTIFIED SOLUTION

ozymandias

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

mjmarlow

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Sinclair

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

glinkot

ASKER

Thankyou everyone for your fantastic comments. The fact is, it's no easy issue. The idea of 'managed code' obviously still has a way to go.

Sinclair - are you saying there's a rollback buffer for things like datatable loads? Is there any way to disable those in code? How do you 'commit' in code, for example in the code above could i do something every 10000 rows to commit the buffered lines into the datatable?

Thanks once again

Sinclair

Yes.
I am not a SQLServer expert, but, in general, all real databases have the concept of a "transaction". All DB operations you do happen inside a transaction; no one else can see the results but you. At the end, you can commit the transaction, persisting the changes to the DB, or rollback the transaction, undoing all changes. You can't persist half the changes in the transaction and discard some others; this keeps the database consistent ( http://www.google.com/search?client=opera&rls=en&q=acid+compliance&sourceid=opera&ie=utf-8&oe=utf-8 )

So, all the DB operations you perform get stored in some temporary buffer, which gets committed to disk when you commit the transaction. Well, if you have gigabytes and gigabytes of data, this buffer will overflow. Thus, the common pattern to use is something like this:

using(OracleConnection conn = new OracleConnection(myLoginCredentials)) {
conn.Open();
OracleTransaction trans = conn.BeginTransaction();
int count = 0;
foreach(SomeRecord record in dataFile) {
insert the record into the DB
++count;
if(count >= BATCH_SIZE) {
trans.Commit();
trans.BeginTransaction();
count = 0;
}
}
trans.Commit();
conn.Close();
}

So, you commit the transaction every BATCH_SIZE operations to flush the rollback segment; the assumption here is that all the records in your data file are independent, so that the DB won't get stuck in an inconsistent state if you commit only half of them.

The above was Oracle code, but Microsoft has similar code for SQLServer, MySql, and most other databases; unfortunately, I've only used Oracle, so I can't tell you the exact syntax, sorry; if you check the methods on the Connection and Transaction classes, you should find the exact APIs.

Technically, you could commit every single record you insert, and not bother with BATCH_SIZE at all; but, in practice, opening a transaction is a relatively expensive process, so you don't want to do it too often.

Sinclair

I should add that, currently, I'm loading about 3Gb of data into the DB using the above method, so managed code does work for me :-)

glinkot

ASKER

Thanks Sinclair. I've managed to get all the records into the DB table, but the C# notion of a 'datatable' is more of a RAM dwelling entity I think, and that's the one that crashes. I was thinking you were talking about that one, as opposed to an actual DB table.

Thanks everyone for your help, it's been really interesting actually. Key takeout for me is that I should either use analysis services, use a DB instead, or buy lots of ram (and possibly a 64 bit PC)!

Cheers guys, you are geniuses (genii?)

Sinclair

I'm a genie ! :-)

Anyway, yeah, you probably want a real DB. I think C# has bindings for MySQL; I'm not sure about PostgreSQL. Oracle and SQLServer are obviously supported, but they're not free.

ozymandias

If the memory resident dataset has an equivalent of a rollback segment the only equivalent I can think of to a commit would be the AcceptChanges() method.
In the case of inserted rows which is all you are doing (I think) that would simply mean resetting the rowstates of those rows.
I don't think it makes any diffference in memory size terms what the rowstate varibale is set to for each row.

glinkot

ASKER

Yeah I'm using SQL Server express but word is that DB's in general are too slow for what I want to do. For interest sake the exercise I'm trying is at www.netflixprize.com. It's a $1M prize for developing a new algorithm to predict people's movie preferences. The sample of preference data etc is over 100 million records, and if you're going to analyse them within a reasonable period you need them pretty much in RAM I think.

I stand no chance of winning any cash (there are universities and numerous eggheads already leading the field), but the idea intrigues me and I know logically what I want my ingenious algorithm to do. However my programming skills are not up to par with idea generating skills! It's a bit of fun.

There is a perl data manipulation language called PDL that is apparently suitable for such things - but I'm learning C# right now and don't feel like starting from scratch with another one!

If anyone is interested in having a stab at it with me, let me know! :)

glinkot

ASKER

Thanks ozy, I might try it for good measure (and buy more ram) :)

Sinclair

Well, if you were really writing an application for netflix, then you'd most likely be using a database, because this is (again, most likely) how netflix stores their data to begin with. As I said earlier, if performance is an issue, then I'd write my code as a stored procedure.

Keep in mind that, even though netflix has 100e6 rows, they also have millions of dollars. This means that they have a whole server farm powering their clustered database, optimized for rapid access, so it should be reasonably fast.

I don't think that keeping all the records in RAM is a reasonable proposition for your specific task, because you don't have a million dollars, which means that you probably have 2-4 Gb of RAM total. If the size of your data is more than that, you'll end up swapping to disk anyway, which is kinda like using a DB, only a lot slower.

glinkot

ASKER

Yep, true and true. They aren't too concerned about whether the system is production quality though, really what they want is the algorithm (in whatever form) and they will rewrite it. In fact they don't have any restrictions around how long the algorithm takes - months is ok in theory, and they admit even their own takes days right now. The only performance requirement from my own perspective though is that it takes a reasonable period to find out whether each change to the code has helped! If it takes 2 weeks to run and you realise you left something out - disaster! :)

Sinclair

Well, can you isolate a small portion of the data to work with ? Maybe 100k or so randomly chosen rows ?

glinkot

ASKER

Yes, I quite possibly can, but the method of choosing the rows is probably going to be more structured - more like the 'tree' structure used in analysing chess moves. Because the data is quite sparse in terms of relationships between users and movies (ie if I have seen these 3 movies and want to predict my rating on this other one, I need to find samples of others who have seen similar movies and calculate based on that). If you dilute those relationships by a factor of 10 you don't end up with many people/ratings to calculate with!

It's a great mental puzzle.