Solved

C# App crashes when trying to load large datatable

Posted on 2006-11-11
25
2,042 Views
Last Modified: 2012-08-14
Hi all!

I am trying to load an *enormous* datafile into an array or some other structure to manipulate it (100 million rows, though each row only about 7 bytes in total).  I have found that XP windows freezes totally (need to hold down power button to turn it off!) when it gets to a certain size in memory.  I've had some success loading the data into 3 arrays (an int16, an int32 and a byte for each record, eg 111, 22222, 3) without it crashing.  However if I try to load the same data into a datatable, XP crashes.

I've looked at various options but am no expert.  There seem to be things like weak references, BTREE's etc that all sound too complicated.  All I really wanted was for the damn thing to be held in virtual memory and looked after transparently by the system!  I've set the virtual memory to be system managed, but it doesn't seem to max that out.  I searched out something on this site about a 1.2gig limitation per process, though I don't think I'm pushing that limit with this exercise.

SO, essentially I want my set of data to be available in a datatable, ready for me to sort/filter/manipulate using the datatable's methods.  Why doesn't windows use virtual memory rather than crashing?  What happened to the idea of 'managed code' that is meant to do the mem management for me?  

My code is below...  If you set the maxrecordstoread to 1000, it's fine of course.  If you set it to the larger number (101 million!) it crashes after 5-6 million (see that I'm writing a log file as a 'black box').  I know it's a lot of records, but a DB is too slow for my purposes (trying to analyse trends in this data).

Many thanks guys!

        private void button3_Click(object sender, EventArgs e)
        {
            // Input file data lines look like these if you are masochistic and want to generate a dummy set!:
            //1,1248029,3
            //1,1842128,4
            //1,2238063,3
            //1,1503895,4
            //1,2207774,5
            //1,2590061,3

            DataTable dt = new DataTable();
            dt.Columns.Add("movieid", typeof(Int16));
            dt.Columns.Add("user", typeof(Int32));
            dt.Columns.Add("rating", typeof(byte));

            StreamReader sr = new StreamReader(File.OpenRead(@"d://joinedout.txt"));
            string aLine;
            Int64 counter = 0;
            Int64 maxrecordstoread = 110000000;
            //Int64 maxrecordstoread = 1000;
            string[] fields;

            StreamWriter sw = new StreamWriter("d://log.txt", true);

            while (!sr.EndOfStream && counter < maxrecordstoread)
            {
                counter++;
                aLine = sr.ReadLine();
                fields = aLine.Split(',');

                dt.Rows.Add((new Object[] { Convert.ToInt16(fields[0]), Convert.ToInt32(fields[1]), Convert.ToByte(fields[2]) }));
                sw.Write(counter.ToString() + "\r\n");
            }

            label1.Text = "Loaded";

        }
0
Comment
Question by:glinkot
  • 9
  • 6
  • 4
  • +3
25 Comments
 
LVL 15

Expert Comment

by:ozymandias
ID: 17923374
I think you are going to have a job here.
Even if you can load all the data how are you going to anlyse it without indexes and if you build indexes they will take up some space too, and take time to build.

Wouldn't you be better of leaving all this stuff in the database and building an OLAP cube over the top of it to do your analysis.
You only have a maximum of three dimensions to work with, once the cube was built it would be easy.
0
 

Author Comment

by:glinkot
ID: 17923397
Hi ozymandias, the task (which is definitely a job as you say!) is probably to do a factor analysis and then cluster analysis on the stored data, which probably means about 7 or so floating point fields in a second table with 17000 rows (I am analysing movie preferences).  I'm not sure how well a cube would lend itself to analysis like clusters etc?  I haven't looked too much at the whole OLAP implementation thing though I have a fair idea of what they are etc.

Even if it isn't the right way, a key question is still why XP just freezes rather than giving me the message that it's going to try to increase Virtual Memory (or some other more graceful response!)
0
 
LVL 29

Expert Comment

by:Gautham Janardhan
ID: 17924171
<(@"d://joinedout.txt"> can u inport this to an sql database and try reading it from there u would have ,ore performance i think..there must be limit to what a txt file could manage;
0
 

Author Comment

by:glinkot
ID: 17924195
Hi gauthampj, I have tried querying the same from SQL and it crashes in the same way.  Thanks for your efforts though mate!
0
 
LVL 29

Expert Comment

by:Gautham Janardhan
ID: 17924198
what i ment was not the bulk as a whole we could page it like

while insertintg into a databse u could insert into a table which would have a auto num and u could retrieve data like

first 1-10,000
then necxt 10001 - 20000 like wise
0
 
LVL 7

Expert Comment

by:mjmarlow
ID: 17924332
A process running in windows XP 32 bit version  is limited to 2 GB of memory -  virtual or not.    

In your code where you "bulk" load the datatable be sure to wrap the operation where many rows are added with BeginLoadData();
EndLoadData();

Once your CPU intensive process dives into paged memory things get much slower.  

Instrument your code to see what is happening with memory while you are loading the data.

            Process p = Process.GetCurrentProcess();
            long peakPagedMem = p.PeakPagedMemorySize64;
            long peakVirtualMem = p.PeakVirtualMemorySize64;
            long peakWorkingSet = p.PeakWorkingSet64;
            long maxWorkingSet = p.MaxWorkingSet;




0
 
LVL 7

Expert Comment

by:mjmarlow
ID: 17924363
Should have been:
IntPtr maxWorkingSet = p.MaxWorkingSet;
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 17924508
If you buiuld an OLAP cube using SQL Analysis Services you can then do a cluster analysis using the data mining model wizard.
I would think of giving it a go.
0
 

Author Comment

by:glinkot
ID: 17927025
Thanks everyone for your help, much appreciated!  I guess it's looking like none of those BTREE or weak references are the right area to examine.  Notes below:

*mjmarlow*

Great suggestions, a couple of quick q's-
a) The BeginLoadData and EndLoadData 'don't exist in the current context' for me (or appear in intellisense).  In the object browser I see them under the System.Data.DataTable namespace, but even typing all that stuff explicitly does not help!  I've already put 'using' statements for Data, but underneath Data there is no 'datatable' section.

b) All those process commands - Would I capture those every few hundred records?  What would I do with the info once obtained?

*ozymandias*

I've only got SQL express 2k5 on my machine right now, and the full version wanted a server OS and told me to go jump!  I might give that a whirl if I can get the IT guys at work to spare me an instance of it... :)

*gauthamp*

The idea of doing it in chunks is something I did give thought to, but unfortunately I need the whole dataset accessible at once.


0
 
LVL 4

Expert Comment

by:boy8964
ID: 17928083
You need BIG RAM for your coumputer than anything else in this case!
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 200 total points
ID: 17928414
I have a virtual machine with Windows 2003 Server and SQL Enterprise with reporting, analysis servcies etc installed.
It works a treat.

If you don't want to go the SQL/OLAP route you culd try working with the data by streaming it instead of trying to load it all at once.
I.e. stream through the records doing your uniques, counts, averages etc but discarding each record as you go.

Then once you have some stats to look at the data through you can start streaming it again but only loading those rows that you need for each analysis.
0
 
LVL 7

Assisted Solution

by:mjmarlow
mjmarlow earned 200 total points
ID: 17928938
For example:

       StreamWriter sw = new StreamWriter("d://log.txt", true);
            StreamWriter swMemLog = new StreamWriter("d://memlog.txt", true);
            dt.BeginLoadData();
            Process p = Process.GetCurrentProcess();
            long maxWorkingSet = p.MaxWorkingSet;
            // Write to log file.
            swMemLog.WriteLine("Process:{0} MaxWorkingSet:{1} Start:{2}",
                p.Id,
                maxWorkingSet,
                DateTime.Now);

            while (!sr.EndOfStream && counter < maxrecordstoread)
            {
                counter++;
                aLine = sr.ReadLine();
                fields = aLine.Split(',');

                dt.Rows.Add((new Object[] { Convert.ToInt16(fields[0]), Convert.ToInt32(fields[1]), Convert.ToByte(fields[2]) }));
                sw.Write(counter.ToString() + "\r\n");
 
                // log memory status every 10000 reads
                if (counter % 10000 == 0)
                {

                    long peakPagedMem = p.PeakPagedMemorySize64;
                    long peakVirtualMem = p.PeakVirtualMemorySize64;
                    long peakWorkingSet = p.PeakWorkingSet64;
                     
                 
                    // write to memory status log file...
                    swMemLog.WriteLine("{0},{1},{2}",
                        peakPagedMem,
                        peakVirtualMem,
                        peakWorkingSet);

                }
            }
            dt.EndLoadData();
            swMemLog.Close();
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 2

Assisted Solution

by:Sinclair
Sinclair earned 100 total points
ID: 17934047
This is probably not what you wanted to hear, but could you write this code as a stored procedure in the DB, as opposed to trying to fit all the data in memory ? The reason I'm asking is that the DB already implements all these indexes and B-Trees that you'd end up implementing manually, so you may save yourself some trouble.

I think your data-loading code is crashing because you're exhausting the size of the rollback segment (which is, apparently, held entirely in memory... go figure). I'd suggest committing the transaction every 10000 or so operations.
0
 

Author Comment

by:glinkot
ID: 17934473
Thankyou everyone for your fantastic comments.  The fact is, it's no easy issue.  The idea of 'managed code' obviously still has a way to go.

Sinclair - are you saying there's a rollback buffer for things like datatable loads?  Is there any way to disable those in code?  How do you 'commit' in code, for example in the code above could i do something every 10000 rows to commit the buffered lines into the datatable?

Thanks once again
0
 
LVL 2

Expert Comment

by:Sinclair
ID: 17934603
Yes.
 I am not a SQLServer expert, but, in general, all real databases have the concept of a "transaction". All DB operations you do happen inside a transaction; no one else can see the results but you. At the end, you can commit the transaction, persisting the changes to the DB, or rollback the transaction, undoing all changes. You can't persist half the changes in the transaction and discard some others; this keeps the database consistent ( http://www.google.com/search?client=opera&rls=en&q=acid+compliance&sourceid=opera&ie=utf-8&oe=utf-8 )

So, all the DB operations you perform get stored in some temporary buffer, which gets committed to disk when you commit the transaction. Well, if you have gigabytes and gigabytes of data, this buffer will overflow. Thus, the common pattern to use is something like this:

using(OracleConnection conn = new OracleConnection(myLoginCredentials)) {
  conn.Open();
  OracleTransaction trans = conn.BeginTransaction();
  int count = 0;
  foreach(SomeRecord record in dataFile) {
    insert the record into the DB
    ++count;
    if(count >= BATCH_SIZE) {
        trans.Commit();
        trans.BeginTransaction();
        count = 0;
    }
  }
  trans.Commit();
  conn.Close();
}  

So, you commit the transaction every BATCH_SIZE operations to flush the rollback segment; the assumption here is that all the records in your data file are independent, so that the DB won't get stuck in an inconsistent state if you commit only half of them.

The above was Oracle code, but Microsoft has similar code for SQLServer, MySql, and most other databases; unfortunately, I've only used Oracle, so I can't tell you the exact syntax, sorry; if you check the methods on the Connection and Transaction classes, you should find the exact APIs.

Technically, you could commit every single record you insert, and not bother with BATCH_SIZE at all; but, in practice, opening a transaction is a relatively expensive process, so you don't want to do it too often.
0
 
LVL 2

Expert Comment

by:Sinclair
ID: 17934611
I should add that, currently, I'm loading about 3Gb of data into the DB using the above method, so managed code does work for me :-)
0
 

Author Comment

by:glinkot
ID: 17934631
Thanks Sinclair.  I've managed to get all the records into the DB table, but the C# notion of a 'datatable' is more of a RAM dwelling entity I think, and that's the one that crashes.  I was thinking you were talking about that one, as opposed to an actual DB table.

Thanks everyone for your help, it's been really interesting actually.  Key takeout for me is that I should either use analysis services, use a DB instead, or buy lots of ram (and possibly a 64 bit PC)!

Cheers guys, you are geniuses (genii?)
0
 
LVL 2

Expert Comment

by:Sinclair
ID: 17934659
I'm a genie ! :-)

Anyway, yeah, you probably want a real DB. I think C# has bindings for MySQL; I'm not sure about PostgreSQL. Oracle and SQLServer are obviously supported, but they're not free.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 17934668
If the memory resident dataset has an equivalent of a rollback segment the only equivalent I can think of to a commit would be the AcceptChanges() method.
In the case of inserted rows which is all you are doing (I think) that would simply mean resetting the rowstates of those rows.
I don't think it makes any diffference in memory size terms what the rowstate varibale is set to for each row.
0
 

Author Comment

by:glinkot
ID: 17934698
Yeah I'm using SQL Server express but word is that DB's in general are too slow for what I want to do.  For interest sake the exercise I'm trying is at www.netflixprize.com.  It's a $1M prize for developing a new algorithm to predict people's movie preferences.  The sample of preference data etc is over 100 million records, and if you're going to analyse them within a reasonable period you need them pretty much in RAM I think.  

I stand no chance of winning any cash (there are universities and numerous eggheads already leading the field), but the idea intrigues me and I know logically what I want my ingenious algorithm to do.  However my programming skills are not up to par with idea generating skills!  It's a bit of fun.

There is a perl data manipulation language called PDL that is apparently suitable for such things - but I'm learning C# right now and don't feel like starting from scratch with another one!

If anyone is interested in having a stab at it with me, let me know!  :)
0
 

Author Comment

by:glinkot
ID: 17934704
Thanks ozy, I might try it for good measure (and buy more ram) :)
0
 
LVL 2

Expert Comment

by:Sinclair
ID: 17934776
Well, if you were really writing an application for netflix, then you'd most likely be using a database, because this is (again, most likely) how netflix stores their data to begin with. As I said earlier, if performance is an issue, then I'd write my code as a stored procedure.

Keep in mind that, even though netflix has 100e6 rows, they also have millions of dollars. This means that they have a whole server farm powering their clustered database, optimized for rapid access, so it should be reasonably fast.

I don't think that keeping all the records in RAM is a reasonable proposition for your specific task, because you don't have a million dollars, which means that you probably have 2-4 Gb of RAM total. If the size of your data is more than that, you'll end up swapping to disk anyway, which is kinda like using a DB, only a lot slower.
0
 

Author Comment

by:glinkot
ID: 17934961
Yep, true and true.  They aren't too concerned about whether the system is production quality though, really what they want is the algorithm (in whatever form) and they will rewrite it.  In fact they don't have any restrictions around how long the algorithm takes - months is ok in theory, and they admit even their own takes days right now.  The only performance requirement from my own perspective though is that it takes a reasonable period to find out whether each change to the code has helped!  If it takes 2 weeks to run and you realise you left something out - disaster!  :)
0
 
LVL 2

Expert Comment

by:Sinclair
ID: 17935072
Well, can you isolate a small portion of the data to work with ? Maybe 100k or so randomly chosen rows ?
0
 

Author Comment

by:glinkot
ID: 17935162
Yes, I quite possibly can, but the method of choosing the rows is probably going to be more structured - more like the 'tree' structure used in analysing chess moves.  Because the data is quite sparse in terms of relationships between users and movies (ie if I have seen these 3 movies and want to predict my rating on this other one, I need to find samples of others who have seen similar movies and calculate based on that).  If you dilute those relationships by a factor of 10 you don't end up with many people/ratings to calculate with!

It's a great mental puzzle.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Introduction                                                 Was the var keyword really only brought out to shorten your syntax? Or have the VB language guys got their way in C#? What type of variable is it? All will be revealed.   Also called…
Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now