Parsing a CSV file into a datatable using LINQ and Regex

AID: 7329
  • Status: Published

1900 points

  • Bynepaluz
  • TypeTips/Tricks
  • Posted on2011-09-02 at 17:31:50
Parsing a CSV file is a task that we are confronted with regularly, and although there are a vast number of means to do this, as a newbie, the field can be confusing and the tools can seem complex.

A simple solution to parsing a customized CSV file is to use this function which returns the datatable. You will have to first setup the datatable, and in this simplified scenario, all the fields of the file are imported into the table. (here is a snapshot of the file).
Date,Description,FITID,Amount
30/10/2011,First transaction,001,9.99
01/11/2011,"Second transaction, Withdraw",002,-3.26
03/11/2011,Third transaction,003,1.08

And since we intend to load this data into a datatable, here is the setup of the table in code. (You can achieve the same using the designer).
Dim gTable As New DataTable("MyTable")
With gTable
    .Columns.Add("Date").DataType = GetType(System.DateTime)
    .PrimaryKey = New DataColumn() {.Columns("DelayID")}
    .Columns.Add("Description").DataType = GetType(System.String)
    .Columns.Add("FITID").DataType = GetType(System.String)
    .Columns.Add("Amount").DataType = GetType(System.Double)
    .Columns("FITID").AllowDBNull = False
    .Columns("Amount").DefaultValue = 0
End With
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:

Select allOpen in new window


If you are wondering why the "overkill" in using regex, then the answer lies on the second line of the file which contains a qualified field (a field enclosed in quotes) that contains a comma (which in turn is the file delimiter). The regex ensures that this line is parsed properly, and the code for this is:
Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
                                    
1:
2:

Select allOpen in new window


Since the file has a header row, we need to define this and exclude it from the data using LINQ, but firstly, we should declare it as a string constant to later incorporate into the LINQ query.
Dim header As String = "Date,Description,FITID,Amount"
                                    
1:

Select allOpen in new window


Though the title says using LINQ and Regex to parse a file, we still have to depend on the System.IO namespace to read the contents of the file and also to split the records in the file in order for LINQ to be used. Additionally, it is not un-common for CSV files to contain a blank / empty row, and we need to bear that in mind when constructing the LINQ query.
Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
        Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
                                    
1:
2:
3:

Select allOpen in new window


NOTE: We use LINQ to "declare" new columns with names corresponding to the file header (these are the same names we applied to the columns in the datatable above, though you can choose different names).

Finally, we can iterate through the results of the LINQ query to populate the datatable, first declaring a datarow.
Dim xRow As DataRow
For Each row In custs
    xRow = gTable.NewRow()
    xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
    gTable.Rows.Add(xRow)
Next
                                    
1:
2:
3:
4:
5:
6:

Select allOpen in new window


Note that you can reference the items columns in from the LINQ query by their names.
Finally, here is the complete function (including datatable setup) that you can call to return a populated datatable.
Function readLINQ(ByVal strCustomerFile As String) As DataTable
    Dim gTable As New DataTable("MyTable")
    With gTable
        .Columns.Add("Date").DataType = GetType(System.DateTime)
        .PrimaryKey = New DataColumn() {.Columns("DelayID")}
        .Columns.Add("Description").DataType = GetType(System.String)
        .Columns.Add("FITID").DataType = GetType(System.String)
        .Columns.Add("Amount").DataType = GetType(System.Double)
        .Columns("FITID").AllowDBNull = False
        .Columns("Amount").DefaultValue = 0
    End With
    Dim lines As String() = System.IO.File.ReadAllLines(strCustomerFile)
    Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
    Dim header As String = "Date,Description,FITID,Amount"
    Dim r As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern)
    Dim custs = From line In lines Where line <> header AndAlso Not String.IsNullOrEmpty(line) Let data = r.Split(line)
            Select New With {.Date = data(0), .Description = data(1), .FITID = data(2), .Amount = CDbl(data(3).Trim)}
    Dim xRow As DataRow
    For Each row In custs
        xRow = gTable.NewRow()
        xRow.ItemArray = {row.Date, row.Description, row.FITID, row.Amount}
        gTable.Rows.Add(xRow)
    Next
    Return gTable
End Function
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:

Select allOpen in new window

Asked On
2011-09-02 at 17:31:50ID7329
Tags

Parsing a CSV file into a datatable using LINQ and Regex

Topic

Microsoft Visual Basic.Net

Views
982

Comments

Expert Comment

by: kaufmed on 2011-09-12 at 20:37:18ID: 31463

Questions

    1) Isn't it bad practice to rely on implicit conversion when turning text into DateTime objects? Along the same token, how does your code know if the value "01/11/2011" is "January 11, 2011" or "November 1, 2011"?
    2) Does your regex handle all permutations of comma placment? What happens if a quote is left off of a qualified field?
    3) Would it not be less overhead to do a Skip(1) call on the ReadAllLines call rather than line <> header for every line in the file?

Author Comment

by: nepaluz on 2011-09-13 at 10:30:18ID: 31475

thanks for the comment kaufmed. I shall address your questions below.
1) Isn't it bad practice to rely on implicit conversion when turning text into DateTime objects?

Bad practice? I do not agree with you on that.
how does your code know if the value "01/11/2011" is "January 11, 2011" or "November 1, 2011"?

The code will parse the string to date based on the culture of the thread it is running on. Thus the answer to your question is, it will depend on the thread culture.

To add to the above specific responses, I hope you appreciate that the article addresses parsing a CSV file using regex and LINQ to a datatable, and NOT handling of dates. Indeed, I could have used a file example that did not contain date fields.
2) Does your regex handle all permutations of comma placment? What happens if a quote is left off of a qualified field?

The regex WILL handle all permutations of a comma delimited file with qualified fields without any issue. For your contraption of a record missing a closing quote on a "qualified" field.... well, that is not a qualified field, is it?
3) Would it not be less overhead to do a Skip(1)  ...

Less overhead? Absolutely not. More concise and less descriptive, possibly. But then again, the article is meant to be instructive and detailed yet concise. Thus, an overhead in word count, MOST DEFINITELY!

Expert Comment

by: kaufmed on 2011-09-13 at 10:52:22ID: 31477

nepaluz,

Bad practice? I do not agree with you on that.

Option String On seems to disagree with you  : \

The intention of my comment is to address possible error scenarios one could encounter when parsing CSV files. It is in no way safe to assume that a computer was always the entity creating the CSV file--your code seems to make this assumption. Would it not be possible for an end user to upload a "hand-built" CSV file to a website? If the user forgot to properly qualify the field (i.e. the aforementioned missing quote), will your code handle that? You seemed to imply that this code was directed at "newbies". You and I may know that the date is parsed based on the current thread's culture: would a newbie?

It is not my intention to offend. I am simply trying to ask questions form the point of view of a newbie.

Expert Comment

by: kaufmed on 2011-09-13 at 10:53:13ID: 31478

*sheesh*  I cannot seem to type the word "strict" today...

Option String On seems to disagree with you

Option Strict*

Author Comment

by: nepaluz on 2011-09-13 at 11:05:41ID: 31481

At the risk of sounding like I am banging on about the obvious, may I add that aside from the fictitious situation whereby a newbie user forgot to include a closing quotation character on a field that is supposed to be a qualified field because he had no access to a computer, the article does address and provide a solution for parsing of a CSV file using Regex and LINQ. I am sure you agree with that.

And last time I checked, there was NO option strict errors with the code, or am I missing something here?

Add your Comment

Please Sign up or Log in to comment on this article.

Join Experts Exchange Today

Gain Access to all our Tech Resources

Get personalized answers

Ask unlimited questions

Access Proven Solutions

Search 3.2 million solutions

Read In-Depth How-To Guides

1000+ articles, demos, & tips

Watch Step by Step Tutorials

Learn direct from top tech pros

And Much More!

Your complete tech resource

See Plans and Pricing

30-day free trial. Register in 60 seconds.

Loading Advertisement...

Top Visual Basic.NET Experts

  1. CodeCruiser

    1,541,075

    Genius

    8,400 points yesterday

    Profile
    Rank: Genius
  2. kaufmed

    303,871

    Wizard

    500 points yesterday

    Profile
    Rank: Genius
  3. Idle_Mind

    230,817

    Guru

    2,010 points yesterday

    Profile
    Rank: Savant
  4. nepaluz

    192,076

    Guru

    0 points yesterday

    Profile
    Rank: Sage
  5. PaulHews

    161,438

    Guru

    520 points yesterday

    Profile
    Rank: Genius
  6. BuggyCoder

    150,598

    Guru

    0 points yesterday

    Profile
    Rank: Sage
  7. JamesBurger

    123,179

    Master

    0 points yesterday

    Profile
    Rank: Sage
  8. emoreau

    112,211

    Master

    0 points yesterday

    Profile
    Rank: Genius
  9. Masteraco

    102,128

    Master

    0 points yesterday

    Profile
    Rank: Wizard
  10. TheLearnedOne

    80,982

    Master

    0 points yesterday

    Profile
    Rank: Savant
  11. Dhaest

    63,803

    Master

    2,000 points yesterday

    Profile
    Rank: Genius
  12. MlandaT

    53,803

    Master

    2,100 points yesterday

    Profile
    Rank: Genius
  13. wdosanjos

    53,796

    Master

    0 points yesterday

    Profile
    Rank: Genius
  14. mlmcc

    53,048

    Master

    0 points yesterday

    Profile
    Rank: Savant
  15. RolandDeschain

    41,679

    10 points yesterday

    Profile
    Rank: Sage
  16. srosebabu

    31,025

    2,000 points yesterday

    Profile
    Rank: Guru
  17. mas_oz2003

    28,400

    0 points yesterday

    Profile
    Rank: Genius
  18. sedgwick

    27,350

    0 points yesterday

    Profile
    Rank: Genius
  19. jacko72

    26,596

    0 points yesterday

    Profile
    Rank: Genius
  20. tommyBoy

    25,850

    0 points yesterday

    Profile
    Rank: Genius
  21. dlmille

    22,160

    0 points yesterday

    Profile
    Rank: Genius
  22. imnorie

    21,664

    1,600 points yesterday

    Profile
    Rank: Genius
  23. Cluskitt

    21,418

    0 points yesterday

    Profile
    Rank: Wizard
  24. robert_schutt

    20,440

    0 points yesterday

    Profile
    Rank: Guru
  25. navneethegde

    20,332

    0 points yesterday

    Profile
    Rank: Wizard

Hall Of Fame