Link to home
Start Free TrialLog in
Avatar of kdwood
kdwood

asked on

Best Way to Parse A Huge Text File

Greetings Experts,

I have a huge csv file that has around 765,000 rows of part numbers with their respective inventory levels.  Unfortunately, my distributor cannot supply another format or even a slimmed down version of this file.   With that in mind, I am parsing through this huge file to get the current stock levels for 4,000 items.  While it is working, IT IS SLOW!!!

What would be the most efficient approach to parsing this file keeping in mind that I only need the stock levels for 4,000 items?

I am currently do someting like this:


Thank you in advance,

Keith


I
Public Function ReadFile(ByVal fileName As String) As DataTable
 
        ' Initialize the return values
        
        Dim list As New List(Of String())
        Dim table As DataTable = Nothing
        Dim myCT As Integer = 0
 
        Using parser As New TextFieldParser(fileName)
 
            ' Setup the comma-delimited file parser.
            parser.TextFieldType = FieldType.Delimited
            parser.Delimiters = New String() {","}
            parser.HasFieldsEnclosedInQuotes = True
 
            While Not parser.EndOfData
 
                Try
 
                    ' Read the comma-delimited text as fields into a string array.
 
                    Dim input As String() = parser.ReadFields()
 
                    If table Is Nothing Then
                        table = CreateTable(Path.GetFileName(fileName), input)
                    End If
 
                    AddRow(table, input)
 
                Catch ex As MalformedLineException
 
                    ' Ignore invalid lines.
 
                End Try
 
            End While
 
        End Using
 
        Return table
 
    End Function
 
Private Function CreateTable(ByVal name As String, ByVal input As String()) As DataTable
 
        Dim table As New DataTable(name)
      
            ' Add the table columns
 
            With table
 
                .Columns.Add("ActionDef")
                .Columns.Add("VendorPartNo")
                .Columns.Add("MFGNAME")
                .Columns.Add("Desc1")
                .Columns.Add("Desc2")
                .Columns.Add("RetailPrice")
                .Columns.Add("MFGPartNo")
                .Columns.Add("Weight")
                .Columns.Add("UPC")
                .Columns.Add("LENGTH")
                .Columns.Add("WIDTH")
                .Columns.Add("HEIGHT")
                .Columns.Add("CHGFLAG")
                .Columns.Add("QLTCOST")
                .Columns.Add("SPPRICE")
                .Columns.Add("STOCKYN")
                .Columns.Add("STATUS")
                .Columns.Add("CPUCODE")
                .Columns.Add("PRIMCAT")
                .Columns.Add("SUBCAT")
                .Columns.Add("NA")
                .Columns.Add("NEWSTOCK")
                .Columns.Add("REBATE")
                .Columns.Add("SUBPART")
 
            End With              
 
        Return table

Open in new window

SOLUTION
Avatar of gangwisch
gangwisch

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of kdwood
kdwood

ASKER

Gangwisch,

I will give that a try right now.  I'm guessing I could then set and index on the PartNumber field which should speed the queries up.

Thanks for the quick reply,

Keith
Avatar of kdwood

ASKER

Thanks Bob,

I will try your suggestion as well.  


Regards,

Keith
Bob, are thinking of putting 765,000 rows in a DataTable?  

I didn't use the name FastestCsvReader on purpose ;)

Bob
gotcha...  :)
You would want to read values into a structure that didn't suffer from slow-down when adding new elements, but you could still query.

Bob
I would try just querying the file:

        Dim path As String = "C:\temp\data.csv"

        Dim SQL As String = "SELECT *" & _
                            " FROM " & IO.Path.GetFileName(path).Replace(".", "#") & _
                            " WHERE Col5 > 0"

        Dim cnString As String = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
            "Data Source='" & IO.Path.GetDirectoryName(path) & _
            "';Extended Properties=" & Chr(34) & _
            "Text;HDR=Yes;FMT=CSVDelimited" & Chr(34) & ";"

        Dim adapter As New OleDb.OleDbDataAdapter(SQL, cnString)

        Dim table As New DataTable()
        adapter.Fill(table)

        'View the contents of the file in your grid
        Me.DataGridView1.DataSource = table



Here's the contents of the file named "data.csv":

Col1,Col2,Col3,Col4,Col5
1,2,3,4,5
6,7,8,9,0
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Nice work Sancler!  That's cool.

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes.  But the 0.17 is for not actually reading anything.  Amended as below, I'm getting ~5.8.

Roger
        Dim records As Integer = 0
        Dim watch As New Stopwatch()
        watch.Start()
        Dim SQL As String = "SELECT *" & _
                            " FROM " & IO.Path.GetFileName(filename).Replace(".", "#") & _
                            " WHERE Col5 LIKE '10%'"
 
        Dim cnString As String = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
            "Data Source='" & IO.Path.GetDirectoryName(filename) & _
            "';Extended Properties=" & Chr(34) & _
            "Text;HDR=Yes;FMT=CSVDelimited" & Chr(34) & ";"
 
 
        Dim cmd As New OleDb.OleDbCommand(SQL, New OleDb.OleDbConnection(cnString))
 
        cmd.Connection.Open()
        Dim reader As OleDb.OleDbDataReader = cmd.ExecuteReader()
 
        'Un comment to loop through reader, access columns by index:
        Do While reader.Read
            records += 1
        Loop
 
        reader.Close()
        cmd.Connection.Close()
 
        Label4.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
 
        watch.Stop()

Open in new window

PS.  It's my (UK) bed-time.  Any more, I'll pick it up in the morning ;-)

Roger
Avatar of kdwood

ASKER

Thank you all for the great support.  I was able to combine your solutions to get the best scenario.  

Best regards,

Keith

Sancler ~ That was EXACTLY the point!  

After 1.48 seconds (best time), you have a DataTable loaded with the records you want...  But you STILL have to iterate through them to process them, and you did't calculate the time it would take to do
that.

However, after 0.1790938  seconds, you have a DataReader loaded with the records you want...
Must, much faster!

Great comments though!

Cherio

VBRocks - we'll have to agree to differ ;-)

Although "you have a DataReader loaded with the records you want" you don't have the data: that's still in the database.  Before you can do _anything_ with it, you'll still need to read each record over.

To try to take account of your point about having to iterate through in order to process, I've rejigged the modified Bob's method to count the records in the same way as in the datareader code.  It's below.  I'm still getting similar figures.

But, in reality, the figures - whichever they are - hardly qualify as "SLOW".  And the questioner seems to be happy.

Roger
Imports System.IO
Public Class Form1
 
    Private filename As String = "C:\Test\Large.csv"
 
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim records As Integer = 0
        Dim watch As New Stopwatch()
        watch.Start()
        Dim dt1 As New DataTable
        For i As Integer = 1 To 5
            Dim dc As New DataColumn("Col" & i.ToString, GetType(String))
            dt1.Columns.Add(dc)
        Next
 
        Using reader As New StreamReader(filename)
            While Not reader.EndOfStream()
                Dim input As String = reader.ReadLine()
                Dim fields As String() = input.Split(","c)
                If fields(4).Substring(0, 2) = "10" Then
                    dt1.Rows.Add(fields)
                End If
            End While
        End Using
 
        For Each dr As DataRow In dt1.Rows
            records += 1
        Next
 
        Label1.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
 
        watch.Stop()
    End Sub
 
    Private Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
 
        Dim records As Integer = 0
        Dim watch As New Stopwatch()
        watch.Start()
        Dim SQL As String = "SELECT *" & _
                            " FROM " & IO.Path.GetFileName(filename).Replace(".", "#") & _
                            " WHERE Col5 LIKE '10%'"
 
        Dim cnString As String = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
            "Data Source='" & IO.Path.GetDirectoryName(filename) & _
            "';Extended Properties=" & Chr(34) & _
            "Text;HDR=Yes;FMT=CSVDelimited" & Chr(34) & ";"
 
 
        Dim cmd As New OleDb.OleDbCommand(SQL, New OleDb.OleDbConnection(cnString))
 
        cmd.Connection.Open()
        Dim reader As OleDb.OleDbDataReader = cmd.ExecuteReader()
 
        'Un comment to loop through reader, access columns by index:
        Do While reader.Read
            records += 1
        Loop
 
        reader.Close()
        cmd.Connection.Close()
 
        Label2.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
 
        watch.Stop()
    End Sub
 
 
End Class

Open in new window

yes the datareader is the best way to go. And if you want to make the query any faster create a stored procedure on your sql server enabling you to get the dataset faster because it is pre-compiled into binary.