Parsing a CSV file is a task that we are confronted with regularly, and although there are a vast number of means to do this, as a newbie, the field can be confusing and the tools can seem complex.
A simple solution to parsing a customized CSV file is to use this function which returns the datatable. You will have to first setup the datatable, and in this simplified scenario, all the fields of the file are imported into the table. (here is a snapshot of the file).
Date,Description,FITID,Amount
30/10/2011,First transaction,001,9.99
01/11/2011,"Second transaction, Withdraw",002,-3.26
03/11/2011,Third transaction,003,1.08
And since we intend to load this data into a datatable, here is the setup of the table in code. (You can achieve the same using the designer).
Dim gTable As New DataTable("MyTable")
With gTable
.Columns.Add("Date").DataType = GetType(System.DateTime)
.PrimaryKey = New DataColumn() {.Columns("DelayID")}
.Columns.Add("Description").DataType = GetType(System.String)
.Columns.Add("FITID").DataType = GetType(System.String)
.Columns.Add("Amount").DataType = GetType(System.Double)
.Columns("FITID").AllowDBNull = False
.Columns("Amount").DefaultValue = 0
End With
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Select allOpen in new window
If you are wondering why the "overkill" in using regex, then the answer lies on the second line of the file which contains a qualified field (a field enclosed in quotes) that contains a comma (which in turn is the file delimiter). The regex ensures that this line is parsed properly, and the code for this is:
Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
1:
2:
Select allOpen in new window
Since the file has a header row, we need to define this and exclude it from the data using LINQ, but firstly, we should declare it as a string constant to later incorporate into the LINQ query.
Though the title says using LINQ and Regex to parse a file, we still have to depend on the System.IO namespace to read the contents of the file and also to split the records in the file in order for LINQ to be used. Additionally, it is not un-common for CSV files to contain a blank / empty row, and we need to bear that in mind when constructing the LINQ query.
Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
1:
2:
3:
Select allOpen in new window
NOTE: We use LINQ to "declare" new columns with names corresponding to the file header (these are the same names we applied to the columns in the datatable above, though you can choose different names).
Finally, we can iterate through the results of the LINQ query to populate the datatable, first declaring a datarow.
Dim xRow As DataRow
For Each row In custs
xRow = gTable.NewRow()
xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
gTable.Rows.Add(xRow)
Next
1:
2:
3:
4:
5:
6:
Select allOpen in new window
Note that you can reference the items columns in from the LINQ query by their names.
Finally, here is the complete function (including datatable setup) that you can call to return a populated datatable.
Function readLINQ(ByVal strCustomerFile As String) As DataTable
Dim gTable As New DataTable("MyTable")
With gTable
.Columns.Add("Date").DataType = GetType(System.DateTime)
.PrimaryKey = New DataColumn() {.Columns("DelayID")}
.Columns.Add("Description").DataType = GetType(System.String)
.Columns.Add("FITID").DataType = GetType(System.String)
.Columns.Add("Amount").DataType = GetType(System.Double)
.Columns("FITID").AllowDBNull = False
.Columns("Amount").DefaultValue = 0
End With
Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
Dim header As String = "Date,Description,FITID,Amount"
Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
Dim xRow As DataRow
For Each row In custs
xRow = gTable.NewRow()
xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
gTable.Rows.Add(xRow)
Next
Return gTable
End Function
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
Select allOpen in new window
by: kaufmed on 2011-09-12 at 20:37:18ID: 31463
1) Isn't it bad practice to rely on implicit conversion when turning text into DateTime objects? Along the same token, how does your code know if the value "01/11/2011" is "January 11, 2011" or "November 1, 2011"?
2) Does your regex handle all permutations of comma placment? What happens if a quote is left off of a qualified field?
3) Would it not be less overhead to do a Skip(1) call on the ReadAllLines call rather than line <> header for every line in the file?