Date,Description,FITID,AmoAnd since we intend to load this data into a datatable, here is the setup of the table in code. (You can achieve the same using the designer).unt
30/10/2011,First transaction,001,9.99
01/11/2011,"Second transaction, Withdraw",002,-3.26
03/11/2011,Third transaction,003,1.08
Dim gTable As New DataTable("MyTable")
With gTable
.Columns.Add("Date").DataType = GetType(System.DateTime)
.PrimaryKey = New DataColumn() {.Columns("DelayID")}
.Columns.Add("Description").DataType = GetType(System.String)
.Columns.Add("FITID").DataType = GetType(System.String)
.Columns.Add("Amount").DataType = GetType(System.Double)
.Columns("FITID").AllowDBNull = False
.Columns("Amount").DefaultValue = 0
End With
If you are wondering why the "overkill" in using regex, then the answer lies on the second line of the file which contains a qualified field (a field enclosed in quotes) that contains a comma (which in turn is the file delimiter). The regex ensures that this line is parsed properly, and the code for this is:
Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
Since the file has a header row, we need to define this and exclude it from the data using LINQ, but firstly, we should declare it as a string constant to later incorporate into the LINQ query.
Dim header As String = "Date,Description,FITID,Amount"
Though the title says using LINQ and Regex to parse a file, we still have to depend on the System.IO namespace to read the contents of the file and also to split the records in the file in order for LINQ to be used. Additionally, it is not un-common for CSV files to contain a blank / empty row, and we need to bear that in mind when constructing the LINQ query.
Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
NOTE: We use LINQ to "declare" new columns with names corresponding to the file header (these are the same names we applied to the columns in the datatable above, though you can choose different names).
Dim xRow As DataRow
For Each row In custs
xRow = gTable.NewRow()
xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
gTable.Rows.Add(xRow)
Next
Note that you can reference the items columns in from the LINQ query by their names.
Function readLINQ(ByVal strCustomerFile As String) As DataTable
Dim gTable As New DataTable("MyTable")
With gTable
.Columns.Add("Date").DataType = GetType(System.DateTime)
.PrimaryKey = New DataColumn() {.Columns("DelayID")}
.Columns.Add("Description").DataType = GetType(System.String)
.Columns.Add("FITID").DataType = GetType(System.String)
.Columns.Add("Amount").DataType = GetType(System.Double)
.Columns("FITID").AllowDBNull = False
.Columns("Amount").DefaultValue = 0
End With
Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
Dim header As String = "Date,Description,FITID,Amount"
Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
Dim xRow As DataRow
For Each row In custs
xRow = gTable.NewRow()
xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
gTable.Rows.Add(xRow)
Next
Return gTable
End Function
Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.
Comments (6)
Author
Commented:Bad practice? I do not agree with you on that.
The code will parse the string to date based on the culture of the thread it is running on. Thus the answer to your question is, it will depend on the thread culture.
To add to the above specific responses, I hope you appreciate that the article addresses parsing a CSV file using regex and LINQ to a datatable, and NOT handling of dates. Indeed, I could have used a file example that did not contain date fields.
The regex WILL handle all permutations of a comma delimited file with qualified fields without any issue. For your contraption of a record missing a closing quote on a "qualified" field.... well, that is not a qualified field, is it?
Less overhead? Absolutely not. More concise and less descriptive, possibly. But then again, the article is meant to be instructive and detailed yet concise. Thus, an overhead in word count, MOST DEFINITELY!
Commented:
Option String On seems to disagree with you : \
The intention of my comment is to address possible error scenarios one could encounter when parsing CSV files. It is in no way safe to assume that a computer was always the entity creating the CSV file--your code seems to make this assumption. Would it not be possible for an end user to upload a "hand-built" CSV file to a website? If the user forgot to properly qualify the field (i.e. the aforementioned missing quote), will your code handle that? You seemed to imply that this code was directed at "newbies". You and I may know that the date is parsed based on the current thread's culture: would a newbie?
It is not my intention to offend. I am simply trying to ask questions form the point of view of a newbie.
Commented:
Option Strict*
Author
Commented:And last time I checked, there was NO option strict errors with the code, or am I missing something here?
Commented:
I do have a follow up question. The CSV files I'm dealing with are huge, in the order of 10's of GigaBytes. Rather than readalllines, can you adapt this function to read one line at a time and return the an array as a data table would be over-kill in this situation.
One last thing. Any pointers to a good LINQ tutorial?
View More