We help IT Professionals succeed at work.

Excel Macro to delete duplicate row based on column value...

I have a CSV file that is pipe delimited. I can only have one row per id. The entire column is not duplicated since the results are from an outer join but I can't have more than one result per ID so I just want to take the first result and delete the all others.

I looked at the Duplicate Master Add on, but since the rows aren't always exact I can't use it.

I am attaching a sample of data. So the basic idea would be to check the value before the first delimiter and delete any subsequent rows with that same id.




sample.csv
Comment
Watch Question

Glenn RayExcel VBA Developer
Top Expert 2014

Commented:
If you are using Excel 2007 you can use the Remove Duplicates function (Menu: Data, Remove Duplicates) and only filter on the "id" value in the first column.  This can be done in VBA as follows:
 
Sub Remove_Duplicate_IDs()
   Dim rng As Range
   Set rng = ActiveSheet.Range("A1").CurrentRegion
   rng.RemoveDuplicates Columns:=1, Header:=xlYes
End Sub

Open in new window


If you are using Excel 2003, you will either have to add a formula column to help identify the duplicates and then filter and delete them, or remove the duplicates programmatically.  If this is the case, let us know and I'm sure we can provide those solutions.

-Glenn
Glenn RayExcel VBA Developer
Top Expert 2014

Commented:
If you can't use the Remove Duplicates utility (i.e., earlier than Excel 2007) then you can do the following:

1) Import your pipe-delimited data into Excel
2) Insert a new column before column A.  For convenience, label it "Unique" (cell A1)
3) Insert the following formula in cell A2:
=MATCH(B2,B:B,0)=ROW()
This will display TRUE if the value is unique or FALSE if it is a duplicate.
4) Copy the formula down through column A
5) Turn on data filtering and show only FALSE values in column A
6) Delete all visible rows
7) Turn off data Filtering

Author

Commented:
It's excel 2010, I put in that macro code but it doesn't appear to be deleting anything.
Excel VBA Developer
Top Expert 2014
Commented:
The code will only work if the csv file has been parsed.  

When I opened your file, the data was still represented as text in column A.  You should use the "Text to Columns" function first, setting the delimiter to the pipe (you'll have to manually identify this.
 text to column, delimiter
You could always use a VBA function to preprocess the sample.csv file and remove all but the first occurrence of an ID.  (I would choose Python for this, but since this is an Excel question....)

If you have a lot of data, or several files to process, I find this method easier than processing each file from Excel via formulas or menu-driven wizards.  It's just a personal perference though.  In addition, if you do have several files, you can create a small table of input filenames and output filenames and modify my macro to loop through each line to process all the files.

Keep in mind I've not added any real error checking (like to see if the file really exists).  Also, you will need to add a reference (Tools/References) to the Microsoft Scripting Runtime since the routine depends upon the use of the Dictionary object.

The code reads each line of the input file and parses the text up to the first "|" to determine the key.  It checks the dictionary object to see if that key has already been found in the file.  If not, the key is added to the dictionary and the input line is written to the output file.  If the key has previously been processed, the line is skipped (it isn't written to the output file).   If there isn't a "|" found in the line, the entire line is treated as the key and written to the output file.

Hopefully this helps.


Option Explicit

Const INFILE = "C:\sample.csv"
Const OUTFILE = "C:\out.csv"

Sub preprocess()
    Dim innum As Integer, outnum As Integer, i As Integer
    Dim sbuf As String, searchkey As String
    Dim d As New Dictionary
    
    innum = FreeFile()
    Open INFILE For Input Access Read As #innum
    
    outnum = FreeFile()
    Open OUTFILE For Output Access Write As #outnum
    
    While Not EOF(innum)
        Line Input #innum, sbuf
        i = InStr(sbuf, "|")
        If i <> 0 Then
            searchkey = Left(sbuf, i - 1)
        Else
            searchkey = sbuf
        End If
        If Not (d.Exists(searchkey)) Then
            d(searchkey) = ""
            Print #outnum, sbuf
        End If
    Wend

    Close #innum
    Close #outnum

End Sub

Open in new window