kdwood
asked on
Best Way to Parse A Huge Text File
Greetings Experts,
I have a huge csv file that has around 765,000 rows of part numbers with their respective inventory levels. Unfortunately, my distributor cannot supply another format or even a slimmed down version of this file. With that in mind, I am parsing through this huge file to get the current stock levels for 4,000 items. While it is working, IT IS SLOW!!!
What would be the most efficient approach to parsing this file keeping in mind that I only need the stock levels for 4,000 items?
I am currently do someting like this:
Thank you in advance,
Keith
I
I have a huge csv file that has around 765,000 rows of part numbers with their respective inventory levels. Unfortunately, my distributor cannot supply another format or even a slimmed down version of this file. With that in mind, I am parsing through this huge file to get the current stock levels for 4,000 items. While it is working, IT IS SLOW!!!
What would be the most efficient approach to parsing this file keeping in mind that I only need the stock levels for 4,000 items?
I am currently do someting like this:
Thank you in advance,
Keith
I
Public Function ReadFile(ByVal fileName As String) As DataTable
' Initialize the return values
Dim list As New List(Of String())
Dim table As DataTable = Nothing
Dim myCT As Integer = 0
Using parser As New TextFieldParser(fileName)
' Setup the comma-delimited file parser.
parser.TextFieldType = FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
While Not parser.EndOfData
Try
' Read the comma-delimited text as fields into a string array.
Dim input As String() = parser.ReadFields()
If table Is Nothing Then
table = CreateTable(Path.GetFileName(fileName), input)
End If
AddRow(table, input)
Catch ex As MalformedLineException
' Ignore invalid lines.
End Try
End While
End Using
Return table
End Function
Private Function CreateTable(ByVal name As String, ByVal input As String()) As DataTable
Dim table As New DataTable(name)
' Add the table columns
With table
.Columns.Add("ActionDef")
.Columns.Add("VendorPartNo")
.Columns.Add("MFGNAME")
.Columns.Add("Desc1")
.Columns.Add("Desc2")
.Columns.Add("RetailPrice")
.Columns.Add("MFGPartNo")
.Columns.Add("Weight")
.Columns.Add("UPC")
.Columns.Add("LENGTH")
.Columns.Add("WIDTH")
.Columns.Add("HEIGHT")
.Columns.Add("CHGFLAG")
.Columns.Add("QLTCOST")
.Columns.Add("SPPRICE")
.Columns.Add("STOCKYN")
.Columns.Add("STATUS")
.Columns.Add("CPUCODE")
.Columns.Add("PRIMCAT")
.Columns.Add("SUBCAT")
.Columns.Add("NA")
.Columns.Add("NEWSTOCK")
.Columns.Add("REBATE")
.Columns.Add("SUBPART")
End With
Return table
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks Bob,
I will try your suggestion as well.
Regards,
Keith
I will try your suggestion as well.
Regards,
Keith
Bob, are thinking of putting 765,000 rows in a DataTable?
I didn't use the name FastestCsvReader on purpose ;)
Bob
Bob
gotcha... :)
You would want to read values into a structure that didn't suffer from slow-down when adding new elements, but you could still query.
Bob
Bob
I would try just querying the file:
Dim path As String = "C:\temp\data.csv"
Dim SQL As String = "SELECT *" & _
" FROM " & IO.Path.GetFileName(path). Replace(". ", "#") & _
" WHERE Col5 > 0"
Dim cnString As String = "Provider=Microsoft.Jet.OL EDB.4.0;" & _
"Data Source='" & IO.Path.GetDirectoryName(p ath) & _
"';Extended Properties=" & Chr(34) & _
"Text;HDR=Yes;FMT=CSVDelim ited" & Chr(34) & ";"
Dim adapter As New OleDb.OleDbDataAdapter(SQL , cnString)
Dim table As New DataTable()
adapter.Fill(table)
'View the contents of the file in your grid
Me.DataGridView1.DataSourc e = table
Here's the contents of the file named "data.csv":
Col1,Col2,Col3,Col4,Col5
1,2,3,4,5
6,7,8,9,0
Dim path As String = "C:\temp\data.csv"
Dim SQL As String = "SELECT *" & _
" FROM " & IO.Path.GetFileName(path).
" WHERE Col5 > 0"
Dim cnString As String = "Provider=Microsoft.Jet.OL
"Data Source='" & IO.Path.GetDirectoryName(p
"';Extended Properties=" & Chr(34) & _
"Text;HDR=Yes;FMT=CSVDelim
Dim adapter As New OleDb.OleDbDataAdapter(SQL
Dim table As New DataTable()
adapter.Fill(table)
'View the contents of the file in your grid
Me.DataGridView1.DataSourc
Here's the contents of the file named "data.csv":
Col1,Col2,Col3,Col4,Col5
1,2,3,4,5
6,7,8,9,0
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Nice work Sancler! That's cool.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Yes. But the 0.17 is for not actually reading anything. Amended as below, I'm getting ~5.8.
Roger
Roger
Dim records As Integer = 0
Dim watch As New Stopwatch()
watch.Start()
Dim SQL As String = "SELECT *" & _
" FROM " & IO.Path.GetFileName(filename).Replace(".", "#") & _
" WHERE Col5 LIKE '10%'"
Dim cnString As String = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source='" & IO.Path.GetDirectoryName(filename) & _
"';Extended Properties=" & Chr(34) & _
"Text;HDR=Yes;FMT=CSVDelimited" & Chr(34) & ";"
Dim cmd As New OleDb.OleDbCommand(SQL, New OleDb.OleDbConnection(cnString))
cmd.Connection.Open()
Dim reader As OleDb.OleDbDataReader = cmd.ExecuteReader()
'Un comment to loop through reader, access columns by index:
Do While reader.Read
records += 1
Loop
reader.Close()
cmd.Connection.Close()
Label4.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
watch.Stop()
PS. It's my (UK) bed-time. Any more, I'll pick it up in the morning ;-)
Roger
Roger
ASKER
Thank you all for the great support. I was able to combine your solutions to get the best scenario.
Best regards,
Keith
Best regards,
Keith
Sancler ~ That was EXACTLY the point!
After 1.48 seconds (best time), you have a DataTable loaded with the records you want... But you STILL have to iterate through them to process them, and you did't calculate the time it would take to do
that.
However, after 0.1790938 seconds, you have a DataReader loaded with the records you want...
Must, much faster!
Great comments though!
Cherio
VBRocks - we'll have to agree to differ ;-)
Although "you have a DataReader loaded with the records you want" you don't have the data: that's still in the database. Before you can do _anything_ with it, you'll still need to read each record over.
To try to take account of your point about having to iterate through in order to process, I've rejigged the modified Bob's method to count the records in the same way as in the datareader code. It's below. I'm still getting similar figures.
But, in reality, the figures - whichever they are - hardly qualify as "SLOW". And the questioner seems to be happy.
Roger
Although "you have a DataReader loaded with the records you want" you don't have the data: that's still in the database. Before you can do _anything_ with it, you'll still need to read each record over.
To try to take account of your point about having to iterate through in order to process, I've rejigged the modified Bob's method to count the records in the same way as in the datareader code. It's below. I'm still getting similar figures.
But, in reality, the figures - whichever they are - hardly qualify as "SLOW". And the questioner seems to be happy.
Roger
Imports System.IO
Public Class Form1
Private filename As String = "C:\Test\Large.csv"
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim records As Integer = 0
Dim watch As New Stopwatch()
watch.Start()
Dim dt1 As New DataTable
For i As Integer = 1 To 5
Dim dc As New DataColumn("Col" & i.ToString, GetType(String))
dt1.Columns.Add(dc)
Next
Using reader As New StreamReader(filename)
While Not reader.EndOfStream()
Dim input As String = reader.ReadLine()
Dim fields As String() = input.Split(","c)
If fields(4).Substring(0, 2) = "10" Then
dt1.Rows.Add(fields)
End If
End While
End Using
For Each dr As DataRow In dt1.Rows
records += 1
Next
Label1.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
watch.Stop()
End Sub
Private Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
Dim records As Integer = 0
Dim watch As New Stopwatch()
watch.Start()
Dim SQL As String = "SELECT *" & _
" FROM " & IO.Path.GetFileName(filename).Replace(".", "#") & _
" WHERE Col5 LIKE '10%'"
Dim cnString As String = "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source='" & IO.Path.GetDirectoryName(filename) & _
"';Extended Properties=" & Chr(34) & _
"Text;HDR=Yes;FMT=CSVDelimited" & Chr(34) & ";"
Dim cmd As New OleDb.OleDbCommand(SQL, New OleDb.OleDbConnection(cnString))
cmd.Connection.Open()
Dim reader As OleDb.OleDbDataReader = cmd.ExecuteReader()
'Un comment to loop through reader, access columns by index:
Do While reader.Read
records += 1
Loop
reader.Close()
cmd.Connection.Close()
Label2.Text = watch.Elapsed.TotalSeconds.ToString & " for " & records & " records"
watch.Stop()
End Sub
End Class
yes the datareader is the best way to go. And if you want to make the query any faster create a stored procedure on your sql server enabling you to get the dataset faster because it is pre-compiled into binary.
ASKER
I will give that a try right now. I'm guessing I could then set and index on the PartNumber field which should speed the queries up.
Thanks for the quick reply,
Keith