Best Way to Parse a Text File...

Posted on 2004-11-01
Last Modified: 2010-04-23
Hello everyone,  
  I'm wondering what would be the BEST way to parse a file.  I've come up with 3 ideas.

1) Read File Line by line and parse each line specifically (INSTR, or Regular expressions)
    This just seems like it should be slow..

2) Read in whole file, and use INSTR or regular expressions to parse for data.
   Faster then #1..but I would think 3 would be the fastest.

3) Read in whole file, and use Regular Expressions to find, store, and then remove each match in the file
    saving the Temp File as the routine goes, So the next parse would be even faster since there's not
    as much log to search though.

I'm looking for suggestions, ideas,  and/or code for the fastest and most reliable way to parse a file.

The project that I'm designing would require the parse of MULTIPLE items in the file, and saving the
parsed data into a database, and to return unused (IE unknown) line items.
Question by:pogowolf
    LVL 27

    Expert Comment

    #3 is the best way...
    do you have a sample of the input data to the  Regular Expressions code?

    Author Comment

    Sure do, here you are:

    --Log START
    [10-26-04 18_16_02] Opening chat log...
    [10/26/04 18:16:55] Message of the day: Istarian Auctions have begun!  Join the 'auction' chat channel if you wish to see reports of who has won each plot.  Good luck to all the bidders!
    [10/26/04 18:18:21] Duxx casts Raise Strength II.
    [10/26/04 18:18:22] Duxx casts Enhance Health II.
    [10/26/04 18:18:24] Duxx casts Enhance Armor II.
    [10/26/04 18:18:25] Duxx casts Gift of Speed II.
    [10/26/04 18:18:27] Duxx casts Gift of Health II.
    [10/26/04 18:18:28] Duxx casts Raise Strength II.
    [10/26/04 18:20:49] You begin casting Gift of Speed II on yourself.
    [10/26/04 18:20:52] You cast Gift of Speed II on yourself.
    [10/26/04 18:20:53] You begin casting Swift Feet II on yourself.
    [10/26/04 18:20:57] You cast Swift Feet II on yourself.
    [10/26/04 18:21:32] Ice Golem hits you with Icy Spray I for 44 damage.
    [10/26/04 18:21:34] Ice Golem hits you with Hurl Chunk III for 252 damage.
    [10/26/04 18:21:49] [MarketPlace: Kimbala] so whos looking for this jacquess dude
    [10/26/04 18:22:07] [MarketPlace: Sscortha] anyone check east blight?
    [10/26/04 18:22:20] [MarketPlace: Jerret] where is he
    [10/26/04 18:22:46] [MarketPlace: Kimbala] what?
    [10/26/04 18:22:46] [MarketPlace: PersonalJustic] [<!--LI 9630196 884448>Grey Necrofly Wing<!--/LI>] sweet
    [10/26/04 18:22:49] [MarketPlace: Avispa] I bet he has something to do with abandon island - that island is full of resoiurces
    [10/26/04 18:23:06] [MarketPlace: Maguai] great, about the greys.. where are the browns?!
    [10/26/04 18:23:32] [MarketPlace: Ivy] Ok huge exploit available now... how do they do it?
    [10/26/04 18:23:33] Stinging Cold III has faded.
    [10/26/04 18:23:39] [MarketPlace: Kimbala] what?
    [10/26/04 18:23:41] [MarketPlace: Ivy] How does AE always mess up something simple
    [10/26/04 18:23:46] [MarketPlace: Avispa] yup - heading over there now to scout it again
    [10/26/04 18:23:49] [MarketPlace: T`rekannor] so ne thing new?
    [10/26/04 18:23:57] [MarketPlace: Kimbala] ivy what are you talking about
    [10/26/04 18:23:58] Swift Feet II has faded.
    [10/26/04 18:24:00] You use Sprint.
    --Log END

    As you can see, the line items have a pattern, and those patterns are what I would like to use as the Regular expressions patterns to pull the data from.

    For example in this line from the log:
    [10/26/04 18:23:57] [MarketPlace: Kimbala] ivy what are you talking about

    I would need to pull the Date/Time Stamp, the channel name (MarketPlace)
    the username (Kimbala) and the Message (ivy what are you talking about)

    but for this line from the log:
    [10/26/04 18:21:32] Ice Golem hits you with Icy Spray I for 44 damage.

    I would also need to pull the date/Time stamp, the name of the monster (Ice Golem) the Ability (Icy Spray I) and the Damage (44)
    LVL 27

    Expert Comment

    Here is a small sample of how you can use the Regular Expressions .....

        Public Function StreamReaderReadCharFile(ByVal sFileName As String)
            Dim myStreamReader As StreamReader
            Dim myLine As String
            Dim sTimeStamp As String

                ' Create a StreamReader using a Shared (static) File class.
                myStreamReader = File.OpenText(sFileName)

                 myLine = myStreamReader.ReadToEnd
                TextBox1.Text = ParseBlocks(myLine)        '<----- I used textbox1 and an output to view code
             Catch exc As Exception
                MsgBox("File could not be opened or read." + vbCrLf + _
                    "Please verify that the filename is correct, " + _
                    "and that you have read permissions for the desired " + _
                    "directory." + vbCrLf + vbCrLf + "Exception: " + exc.Message)
                ' Close the object if it has been created.
                If Not myStreamReader Is Nothing Then
                End If
                StreamReaderReadCharFile = sData
            End Try
        End Function
        Public Function ParseBlocks(ByVal input As String)

            'Regular Expression:  ([\[])([\w\d\s:])+([\]])
            Dim pattern As String = "([\[])([\w\d\s:])+([\]])"
            Dim rx As Regex = New Regex(pattern, RegexOptions.Multiline)
            Dim sData As String
            Dim m As Match
            Dim rowCount As Integer = 1

            For Each m In rx.Matches(input)
                sData += rowCount.ToString() + ": " + _
                             m.ToString + vbCrLf
                rowCount += 1
            Next m
            Return sData
        End Function

    Author Comment

    Well that would be part of the issue.  I could modify the ParseBlocks function to accept the pattern as input.. hmm.. How could I return the Matches Object so that the main code could work with the data?

    Also, this example doesn't take into account the want to delete a matched line...  Though I guess it might work to
    stream out Non-matched patternes, and then reload the file for the next 'round' of searches...

    LVL 12

    Expert Comment

    [Not an answer so much as a few extra tips.]

    If you have huge files, it's quite possible that #1 (or something like it) would be faster.  Loading huge files into memory, especially if you have multiple copies of it (the Temp file) can take a lot of time depending on the amount of resources on your system.  If allocating the memory causes other memory to be swapped out to disk, you'll be hurtin'.

    It wouldn't be too hard to write several implementations, and actually test to see which is faster for your specific cases.

    P.S. For more speed, look into compiling your Regex patterns if and only if you can reuse a single pattern 100's of times.  A _very_ brief summary is here: (Search that page for "compile".)
    LVL 12

    Accepted Solution

    What I meant by "#1 (or something like it)" is that it would be much better to read large blocks (256K, maybe even 1M) at a time, rather than line-at-a-time.    This increases the complexity though, because you need to handle the lines that go across the block boundaries.

    Author Comment

    Good Point FarSight,
      I'll take a look into that.  I am worried about the amount of memory it's going to take, even if the adverage size of the log is only about 150k..  

    Thanks for the link, I'll take a look at the site!

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    I'm currently working for a company where I have to upgrade over 50 VB6 programs to VB.NET 2008.  So far I'm about half way through, and I've learned quite a few tricks that drastically improve the performance of VB.NET apps. Because there are a…
    Article by: jpaulino
    XML Literals are a great way to handle XML files and the community doesn’t use it as much as it should.  An XML Literal is like a String ( Literal, only instead of starting and ending with w…
    Need more eyes on your posted question? Go ahead and follow the quick steps in this video to learn how to Request Attention to your question. *Log into your Experts Exchange account *Find the question you want to Request Attention for *Go to the e…
    Sending a Secure fax is easy with eFax Corporate ( First, Just open a new email message.  In the To field, type your recipient's fax number You can even send a secure international fax — just include t…

    884 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now