We help IT Professionals succeed at work.

Tool/program to validate large csv file

stevelucy
stevelucy used Ask the Experts™
on
We need to write a program or use a tool to validate a csv file. Periodically we receive a csv file of roughly 10 million rows that we need to upload into our data mart.

Prior to the upload we would like to validate that the file is formatted correctly and that the content appears to be valid. The validations would be:

1. That the column headings are correct and match the expected names
2. That the content of numeric variables are numeric and date variables are dates

A report should be produced showing which rows and fields are in error.

The solution will be installed on a Windows server that we will dedicate to this process.

The program should be user friendly enough that a typical office person could run it and review the results. A network admin person will install the solution on the server.  Speed is not critical as this will run on a dedicated server, and overnight processing is acceptable.

What's the best tool and/or language to use to get something like this developed and deployed?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Commented:
For code suggestion, we'd need to know the layout of the CSV file. Having said that, this can be accomplished fairly easily using vb.net (or C#, VBScript as suggested above, etc), and since CSV files differ in their layout, you'd need a custom program to handle your CSV file that implements your custom validations. To elaborate, here is an example class that uses a backgroundworker to read a CSV file with two columns (Date and value). It tests for the date in the first column and the value in the second column (in the system culture) and logs an error in the error log (if there is one). you can change this to display errors in a grid etc

1. First the class refere to above
    Public Class ReadFile
        Public Class LineError
            Public LineNumber As Integer
            Public DateCheck1 As Boolean
            Public NumberCheck1 As Boolean
        End Class
        Public Sub CountWords(ByVal worker As System.ComponentModel.BackgroundWorker, ByVal e As System.ComponentModel.DoWorkEventArgs)
            Dim state As New LineError
            Dim LinesCounted As Integer = 0
            Dim LineRead As String() = Nothing
            Using FileReader As New FileIO.TextFieldParser("YourFileName") With {.Delimiters = {","}}
                'first the header
                LineRead = FileReader.ReadFields()
                'here validate the header fields
                While Not FileReader.EndOfData
                    If worker.CancellationPending Then
                        e.Cancel = True
                        Exit While
                    Else
                        LineRead = FileReader.ReadFields()
                        'first get the line number
                        LinesCounted += 1
                        'now check the fields in the line
                        state.LineNumber = LinesCounted
                        state.DateCheck1 = IsDate(LineRead(0))
                        state.NumberCheck1 = IsNumeric(LineRead(1))
                        If Not state.DateCheck1 OrElse Not state.NumberCheck1 Then
                            worker.ReportProgress(0, state)
                        End If
                    End If
                End While
            End Using
        End Sub
    End Class

Open in new window

2. You'd need to declare the background worker to use to run the class and also add a handler to cater for the error logging.
WithEvents MyWorker As New System.ComponentModel.BackgroundWorker
    Private Sub BackgroundWorker1_ProgressChanged(ByVal sender As Object, ByVal e As System.ComponentModel.ProgressChangedEventArgs) Handles MyWorker.ProgressChanged
        ' This method runs on the main thread.
        Dim state As ReadFile.LineError = CType(e.UserState, ReadFile.LineError)
        Using ErLoger As New IO.StreamWriter("YourLogFile.csv")
            ErLoger.WriteLine(Join({state.LineNumber, state.DateCheck1, state.NumberCheck1}, ","))
        End Using
      
    End Sub

    Private Sub MyWorker_DoWork(ByVal sender As Object, ByVal e As System.ComponentModel.DoWorkEventArgs) Handles MyWorker.DoWork
        ' This event handler is where the actual work is done.
        Dim worker As System.ComponentModel.BackgroundWorker
        worker = CType(sender, System.ComponentModel.BackgroundWorker)
        Dim WC As ReadFile = CType(e.Argument, ReadFile)
        WC.StartValidating(worker, e)

    End Sub

Open in new window

3. Finally, to invoke the routine you'd have to call
Dim BWork As New ReadFile
MyWorker.RunWorkerAsync(BWork)

Open in new window

Look at MS Access if you like to have user friendly product, especially if you want to correct errors.