Link to home
Start Free TrialLog in
Avatar of coopey247
coopey247

asked on

StreamReader slow on large files

I'm using a streamreader to parse through RADIUS logs and extract info for billing.  The log's stop events come in various forms and I've managed to work out the logic to grab what I need, no problem.  When I run it on a small file (1 Mb) it runs very fast, processing over 25 events per second.  However when I run it on a months worth of data (30 Mb), it is extremely slow, processing 3 events per second.  This leads me to believe it is not a bottleneck in my logic to process the files, but rather a problem with StreamReader on large files.  

I don't think it can be threaded because I need to move line by line, grabbing what I need.  Would it be better to use FileStream, which I've researched and supposedly runs slower than streamreader.  Or is there any way to consistently break up a large file into smaller pieces for processing.  Or am I completely wrong and there is still a better way to do this.

Thanks.

(Coming soon, www.ITBrainPower.com, earn computer gear for answering questions)
Avatar of DaniPro
DaniPro
Flag of Italy image

Is very difficult to help you without more details but, if you have enougth memory, you can try to use the BufferedStrem class for reduce the access to OS.

ex.:
string fileName = @"c:\miofile.txt";
System.IO.FileInfo fi = new System.IO.FileInfo(fileName);
System.IO.FileStream fs = new System.IO.FileStream(fileName, System.IO.FileMode.Open);
System.IO.BufferedStream bs = new System.IO.BufferedStream(fs);
Byte[] bytes = new byte[fi.Length];
bs.Read(bytes, 0, (int)fi.Length);
string s1 = System.Text.Encoding.ASCII.GetString(bytes);
Avatar of _TAD_
_TAD_



ok... First, StreamReader Reads a stream.  FileStream is a stream.  Typically you use streamreader to read from a filestream or a memorystream or even a binarystream.

reading line by line is always going to be very slow.  I would suggest reading in 5K to 20K into a byte array (a buffer) and then parsing the buffer to find where your line ends.


With performance issues like this it would be very helpfull if you could post the code in question, we may find something wrong that is completely unrelated to what you are focusing on.


Avatar of coopey247

ASKER

The handling of the streamreader is very simple.


string fileName = "C:\\AVOC\\032904detail";
StreamReader sr =  File.OpenText(fileName);

while ((line=sr.ReadLine())!=null)
    {
     //this determines where a unique event starts
     if ((line.StartsWith("Sun")) || line.StartsWith("Mon")  || (line.StatsWith...... )
       {
        //A bunch of processing, most of which looks like this, basically moving line by line
       //looking for the value I want.  I wanted to use regex for this, but the stop events
       //all differ in the # of lines and values of these lines

        while (statusLine.IndexOf("Acct-Status-Type")==-1)
               {                                                statusLine = sr.ReadLine();
      }

       //I then get all of the values and pass them to an SQL stored procedure
       sqlDataAdapter1.InsertCommand.Parameters.Add("@UserID", SqlDbType.VarChar, 50).Value = user;
      try
       {
       this.sqlDataAdapter1.InsertCommand.ExecuteNonQuery();
       }
     catch (Exception er)
   }  //Close the If
  }  //Close the while
sr.close;


Here is an example of an event in the file I'm reading

Wed Jan 28 09:19:26 2004
      User-Name = "userid"
      NAS-Identifier = 1.1.1.1
      NAS-Port = 10205
      NAS-Port-Type = Sync
      Acct-Status-Type = Stop
      Acct-Delay-Time = 0
      Acct-Session-Id = "344998566"
      Acct-Authentic = RADIUS
      Acct-Session-Time = 244
      Acct-Input-Octets = 3988
      Acct-Output-Octets = 27952
      Acct-Input-Packets = 197
      Acct-Output-Packets = 211
      Ascend-Disconnect-Cause = 45
      Ascend-Connect-Progress = 60
      Ascend-Xmit-Rate = 64000
      Ascend-Data-Rate = 64000
      Ascend-PreSession-Time = 8
      Ascend-Pre-Input-Octets = 252
      Ascend-Pre-Output-Octets = 227
      Ascend-Pre-Input-Packets = 11
      Ascend-Pre-Output-Packets = 11
      Ascend-First-Dest = 1.1.1.1
      Ascend-Multilink-ID = 572
      Ascend-Num-In-Multilink = 0
      Caller-Id = "5555555555"
      Client-Port-DNIS = "68414"
      Framed-Protocol = MPP
      Framed-Address = 1.1.1.1

Hope this helps
its a bit tricky to tell but.... you have this line inside a loop

 sqlDataAdapter1.InsertCommand.Parameters.Add("@UserID", SqlDbType.VarChar, 50).Value = user;

where you are adding a parameter every time you find a user?? Instead of setting the value of the parameter. You may have clipped the code that sorts all this....
Okay, well I moved the definition of the parameters outside of the loop and now just set the value inside the loop.  The problem definately lies within the SQL stored procedure.  When I take the SQL call out it runs great.  But with the SQL insert command runs fast at first, then when it gets to about 2000 records it just crashes.

I tried threading the SQL insert command and it runs good for a little bit but then starts going crazy with all kinds of errors, again around 2000 records.  I'm 99% sure that I'm not taxing the server.  I'm not sure where to go from here.......
You should use an SqlConnection only for one thread. You may get a problem when you open a transaction and insert many records without commiting them since the rollback segment will overflow.



also one thing to considder is that things like cursors are very slow.  If at all possible you may want to try and batch your processes together.  Perhaps load a dataset in memory and then transfer the whole dataset over to the sql server
For your original question: this code will make it faster and you can still keep all your functionality. The difference for large files is enormous:

StreamReader sr =  new StreamReader (new BufferedStream (new FileStream (fileName, FileMode.Open)));

For the SQL Server problem: can you post the stored procedure code so that we can review it?
Stored Procedure:
I had to do the IF NOT EXISTS because the detail files get merged from 4 servers and there are duplicate records.


CREATE PROCEDURE [ASICUSER].[AVOC_UpdateDetails]
(@UserID varchar(50), @Weekday varchar(50), @Month varchar(50),
@MonthNum int, @Day int, @Year int, @Time varchar(50),
@Type varchar(50), @SessionTime varchar(50),
@SessionID bigint, @NASPort bigint, @CallerID bigint) AS

IF NOT EXISTS (select UserID, Weekday, _Month, MonthNum,  _Day,
_Year, _Time, Type, SessionTime, SessionID, NASPort, CallerID
from asic.asicuser.tbl_AVOC_Details
where Weekday = @Weekday and
_Month = @Month and MonthNum = @MonthNum and _Day = @Day and
_Year = @Year and _Time = @Time and
Type = @Type and SessionTime = @SessionTime and
SessionID = @SessionID and NASPort = @NASPort
and CallerID = @CallerID)

Insert Into asic.asicuser.tbl_AVOC_Details
(UserID, Weekday, _Month, MonthNum,  _Day, _Year, _Time, Type, SessionTime,
SessionID, NASPort, CallerID)
Values (@UserID, @Weekday, @Month, @MonthNum, @Day, @Year, @Time, @Type,
@SessionTime, @SessionID, @NASPort, @CallerID)
GO
OK, this seams fine. Do you use transactions?
Nope, thats it.  I think populating the dataset and then merging like TAD said might be the way to go, any arguements against that Avenger?
SOLUTION
Avatar of TheAvenger
TheAvenger
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Can you define the IF NOT EXIST with less checks ie just check Session_id & maybe time(preferably indexed columns), or drop the if not and just insert the info and let the a unique index handle the duplication?
Merging datasets is quite slow and gets extremly slow with lots of data.
I think coopey247 meant update the database by "merge"
I did mean update, however populating a dataset still slows down the process significantly.  I think my next (and maybe last) option may be to write an XML file on the fly and then dump that file up to the SQL box when it finishes.

Thanks for all the help folks, very much appreciated.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Why are you checking all the fields in IF NOT EXISTS? If you have a key you only need to check the key.
Yup, its around 40K inserts a month.  The reason I have to do an IF NOT EXISTS is because these files are pulled from several RADIUS servers, some of which I believe are mirrored because the merged file I get has at least two identical entries for every Event (I.E. Start and Stop events).  The key is an identity column that I have in there for later processing.

Thanks.
May be you could speed it up by passing the file to the stored procedure and do all the processing in the stored procedure. However it will be hard to debug and monitor.
Keen eye on the Stored Procedure, I moved the table structure around a little and did away with the identity column and made a Primary key out of two columns.  The IF NOT EXISTS still works, but runs much faster since I'm only scanning two columns instead of 10.  And its now fast enough to be acceptable.

You guys (maybe gals) are great, thanks for the help.

now I've got to figure out how to divy the points.........
Now the lookup is made in an index (similar to Hashtable) instead of a full table scan.