Solved

StreamReader slow on large files

Posted on 2004-04-20
22
2,985 Views
Last Modified: 2012-08-13
I'm using a streamreader to parse through RADIUS logs and extract info for billing.  The log's stop events come in various forms and I've managed to work out the logic to grab what I need, no problem.  When I run it on a small file (1 Mb) it runs very fast, processing over 25 events per second.  However when I run it on a months worth of data (30 Mb), it is extremely slow, processing 3 events per second.  This leads me to believe it is not a bottleneck in my logic to process the files, but rather a problem with StreamReader on large files.  

I don't think it can be threaded because I need to move line by line, grabbing what I need.  Would it be better to use FileStream, which I've researched and supposedly runs slower than streamreader.  Or is there any way to consistently break up a large file into smaller pieces for processing.  Or am I completely wrong and there is still a better way to do this.

Thanks.

(Coming soon, www.ITBrainPower.com, earn computer gear for answering questions)
0
Comment
Question by:coopey247
  • 7
  • 5
  • 4
  • +3
22 Comments
 
LVL 6

Expert Comment

by:DaniPro
ID: 10868793
Is very difficult to help you without more details but, if you have enougth memory, you can try to use the BufferedStrem class for reduce the access to OS.

ex.:
string fileName = @"c:\miofile.txt";
System.IO.FileInfo fi = new System.IO.FileInfo(fileName);
System.IO.FileStream fs = new System.IO.FileStream(fileName, System.IO.FileMode.Open);
System.IO.BufferedStream bs = new System.IO.BufferedStream(fs);
Byte[] bytes = new byte[fi.Length];
bs.Read(bytes, 0, (int)fi.Length);
string s1 = System.Text.Encoding.ASCII.GetString(bytes);
0
 
LVL 22

Expert Comment

by:_TAD_
ID: 10868824


ok... First, StreamReader Reads a stream.  FileStream is a stream.  Typically you use streamreader to read from a filestream or a memorystream or even a binarystream.

reading line by line is always going to be very slow.  I would suggest reading in 5K to 20K into a byte array (a buffer) and then parsing the buffer to find where your line ends.


With performance issues like this it would be very helpfull if you could post the code in question, we may find something wrong that is completely unrelated to what you are focusing on.


0
 
LVL 2

Author Comment

by:coopey247
ID: 10869087
The handling of the streamreader is very simple.


string fileName = "C:\\AVOC\\032904detail";
StreamReader sr =  File.OpenText(fileName);

while ((line=sr.ReadLine())!=null)
    {
     //this determines where a unique event starts
     if ((line.StartsWith("Sun")) || line.StartsWith("Mon")  || (line.StatsWith...... )
       {
        //A bunch of processing, most of which looks like this, basically moving line by line
       //looking for the value I want.  I wanted to use regex for this, but the stop events
       //all differ in the # of lines and values of these lines

        while (statusLine.IndexOf("Acct-Status-Type")==-1)
               {                                                statusLine = sr.ReadLine();
      }

       //I then get all of the values and pass them to an SQL stored procedure
       sqlDataAdapter1.InsertCommand.Parameters.Add("@UserID", SqlDbType.VarChar, 50).Value = user;
      try
       {
       this.sqlDataAdapter1.InsertCommand.ExecuteNonQuery();
       }
     catch (Exception er)
   }  //Close the If
  }  //Close the while
sr.close;


Here is an example of an event in the file I'm reading

Wed Jan 28 09:19:26 2004
      User-Name = "userid"
      NAS-Identifier = 1.1.1.1
      NAS-Port = 10205
      NAS-Port-Type = Sync
      Acct-Status-Type = Stop
      Acct-Delay-Time = 0
      Acct-Session-Id = "344998566"
      Acct-Authentic = RADIUS
      Acct-Session-Time = 244
      Acct-Input-Octets = 3988
      Acct-Output-Octets = 27952
      Acct-Input-Packets = 197
      Acct-Output-Packets = 211
      Ascend-Disconnect-Cause = 45
      Ascend-Connect-Progress = 60
      Ascend-Xmit-Rate = 64000
      Ascend-Data-Rate = 64000
      Ascend-PreSession-Time = 8
      Ascend-Pre-Input-Octets = 252
      Ascend-Pre-Output-Octets = 227
      Ascend-Pre-Input-Packets = 11
      Ascend-Pre-Output-Packets = 11
      Ascend-First-Dest = 1.1.1.1
      Ascend-Multilink-ID = 572
      Ascend-Num-In-Multilink = 0
      Caller-Id = "5555555555"
      Client-Port-DNIS = "68414"
      Framed-Protocol = MPP
      Framed-Address = 1.1.1.1

Hope this helps
0
Is Your AD Toolbox Looking More Like a Toybox?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.

 
LVL 5

Expert Comment

by:AdrianJMartin
ID: 10869511
its a bit tricky to tell but.... you have this line inside a loop

 sqlDataAdapter1.InsertCommand.Parameters.Add("@UserID", SqlDbType.VarChar, 50).Value = user;

where you are adding a parameter every time you find a user?? Instead of setting the value of the parameter. You may have clipped the code that sorts all this....
0
 
LVL 2

Author Comment

by:coopey247
ID: 10871533
Okay, well I moved the definition of the parameters outside of the loop and now just set the value inside the loop.  The problem definately lies within the SQL stored procedure.  When I take the SQL call out it runs great.  But with the SQL insert command runs fast at first, then when it gets to about 2000 records it just crashes.

I tried threading the SQL insert command and it runs good for a little bit but then starts going crazy with all kinds of errors, again around 2000 records.  I'm 99% sure that I'm not taxing the server.  I'm not sure where to go from here.......
0
 
LVL 10

Expert Comment

by:ptmcomp
ID: 10873032
You should use an SqlConnection only for one thread. You may get a problem when you open a transaction and insert many records without commiting them since the rollback segment will overflow.
0
 
LVL 22

Expert Comment

by:_TAD_
ID: 10873849



also one thing to considder is that things like cursors are very slow.  If at all possible you may want to try and batch your processes together.  Perhaps load a dataset in memory and then transfer the whole dataset over to the sql server
0
 
LVL 20

Expert Comment

by:TheAvenger
ID: 10875773
For your original question: this code will make it faster and you can still keep all your functionality. The difference for large files is enormous:

StreamReader sr =  new StreamReader (new BufferedStream (new FileStream (fileName, FileMode.Open)));

For the SQL Server problem: can you post the stored procedure code so that we can review it?
0
 
LVL 2

Author Comment

by:coopey247
ID: 10877132
Stored Procedure:
I had to do the IF NOT EXISTS because the detail files get merged from 4 servers and there are duplicate records.


CREATE PROCEDURE [ASICUSER].[AVOC_UpdateDetails]
(@UserID varchar(50), @Weekday varchar(50), @Month varchar(50),
@MonthNum int, @Day int, @Year int, @Time varchar(50),
@Type varchar(50), @SessionTime varchar(50),
@SessionID bigint, @NASPort bigint, @CallerID bigint) AS

IF NOT EXISTS (select UserID, Weekday, _Month, MonthNum,  _Day,
_Year, _Time, Type, SessionTime, SessionID, NASPort, CallerID
from asic.asicuser.tbl_AVOC_Details
where Weekday = @Weekday and
_Month = @Month and MonthNum = @MonthNum and _Day = @Day and
_Year = @Year and _Time = @Time and
Type = @Type and SessionTime = @SessionTime and
SessionID = @SessionID and NASPort = @NASPort
and CallerID = @CallerID)

Insert Into asic.asicuser.tbl_AVOC_Details
(UserID, Weekday, _Month, MonthNum,  _Day, _Year, _Time, Type, SessionTime,
SessionID, NASPort, CallerID)
Values (@UserID, @Weekday, @Month, @MonthNum, @Day, @Year, @Time, @Type,
@SessionTime, @SessionID, @NASPort, @CallerID)
GO
0
 
LVL 20

Expert Comment

by:TheAvenger
ID: 10877140
OK, this seams fine. Do you use transactions?
0
 
LVL 2

Author Comment

by:coopey247
ID: 10877195
Nope, thats it.  I think populating the dataset and then merging like TAD said might be the way to go, any arguements against that Avenger?
0
 
LVL 20

Assisted Solution

by:TheAvenger
TheAvenger earned 100 total points
ID: 10877211
You can give it a try. This will make your life easier (it's simpler for development) but actually the dataset makes the same thing - calls insert for every new record.
0
 
LVL 5

Expert Comment

by:AdrianJMartin
ID: 10877261
Can you define the IF NOT EXIST with less checks ie just check Session_id & maybe time(preferably indexed columns), or drop the if not and just insert the info and let the a unique index handle the duplication?
0
 
LVL 10

Expert Comment

by:ptmcomp
ID: 10877846
Merging datasets is quite slow and gets extremly slow with lots of data.
0
 
LVL 20

Expert Comment

by:TheAvenger
ID: 10877927
I think coopey247 meant update the database by "merge"
0
 
LVL 2

Author Comment

by:coopey247
ID: 10878655
I did mean update, however populating a dataset still slows down the process significantly.  I think my next (and maybe last) option may be to write an XML file on the fly and then dump that file up to the SQL box when it finishes.

Thanks for all the help folks, very much appreciated.
0
 
LVL 5

Accepted Solution

by:
AdrianJMartin earned 300 total points
ID: 10879006
How many inserts are you trying to do? I guessing from you data example, 30k - 40k inserts. which will take a while.....

The IF NOT EXISTS will cause a table scan every time the sp is called and each time take a little longer, each time. It all adds up...which is the behaviour you are seeing?

if the table is set up with a unique index( on a single col or multiple column) then the insert will fail, ignore it and carry on
0
 
LVL 10

Expert Comment

by:ptmcomp
ID: 10880117
Why are you checking all the fields in IF NOT EXISTS? If you have a key you only need to check the key.
0
 
LVL 2

Author Comment

by:coopey247
ID: 10881439
Yup, its around 40K inserts a month.  The reason I have to do an IF NOT EXISTS is because these files are pulled from several RADIUS servers, some of which I believe are mirrored because the merged file I get has at least two identical entries for every Event (I.E. Start and Stop events).  The key is an identity column that I have in there for later processing.

Thanks.
0
 
LVL 10

Expert Comment

by:ptmcomp
ID: 10881569
May be you could speed it up by passing the file to the stored procedure and do all the processing in the stored procedure. However it will be hard to debug and monitor.
0
 
LVL 2

Author Comment

by:coopey247
ID: 10881804
Keen eye on the Stored Procedure, I moved the table structure around a little and did away with the identity column and made a Primary key out of two columns.  The IF NOT EXISTS still works, but runs much faster since I'm only scanning two columns instead of 10.  And its now fast enough to be acceptable.

You guys (maybe gals) are great, thanks for the help.

now I've got to figure out how to divy the points.........
0
 
LVL 10

Expert Comment

by:ptmcomp
ID: 10882475
Now the lookup is made in an index (similar to Hashtable) instead of a full table scan.
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article introduced a TextBox that supports transparent background.   Introduction TextBox is the most widely used control component in GUI design. Most GUI controls do not support transparent background and more or less do not have the…
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Established in 1997, Technology Architects has become one of the most reputable technology solutions companies in the country. TA have been providing businesses with cost effective state-of-the-art solutions and unparalleled service that is designed…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

825 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question