Dedupe a file based on special criteria, any program/code that does this? VB.net/VB6?

Valleriani
Valleriani used Ask the Experts™
on
I have a file with 3mil records. an example of them are

http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=322
http://blah.com/test.php?email=marly@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=323

If you look at the example above, there are three pointed to 'john@blah.com', which is considered a duplicate, though the exact line isn't the same (CID is different). What I want to do is remove the line itself based on a duplicate email, so the result would be:


http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=marly@blah.com&cid=320


Is there any coding in VB that allows this? Or a program that can do this?.. Even if its more then 'one step' I'm all for it.  Maybe some sorta regex?

Thanks!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
What is the file format?
Is it a CSV/TXT doc?
Or a local database?

It sounds possible, but without more info we can't make much progress
Top Expert 2010
Commented:
This is for VB6, and uses RegExp...

Sub GetTheNames()

    Dim fso As Object, ts As Object, coll As Collection, TheLine As String, Addr As String, Counter as Long

    Set fso = CreateObject("Scripting.FileSystemObject")
    Set ts = fso.OpenTextFile("c:\folder\subfolder\Input.txt")
    Set coll = New Collection

    Do Until ts.AtEndOfStream
        TheLine = ts.ReadLine
        Addr = RegExpFind(TheLine, "[\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)*[a-z]{2,4}", 1, False)
        If Addr <> "" Then
            On Error Resume Next
            coll.Add TheLine, Addr
            On Error GoTo 0
        End If
    Loop
    ts.Close
    Set ts = fso.CreateTextFile("c:\folder\subfolder\output.txt")
    For Counter = 1 To coll.Add
        ts.WriteLine coll(Counter)
    Next
    ts.Close

    Set ts = Nothing
    Set fso = Nothing
    Set coll = Nothing

End Sub

Function RegExpFind(LookIn As String, PatternStr As String, Optional Pos, _
    Optional MatchCase As Boolean = True, Optional ReturnType As Long = 0, _
    Optional MultiLine As Boolean = False)
   
    ' Function written by Patrick G. Matthews.  You may use and distribute this code freely,
    ' as long as you properly credit and attribute authorship and the URL of where you
    ' found the code
   
    ' For more info, please see:
    ' http://www.experts-exchange.com/articles/Programming/Languages/Visual_Basic/Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html
   
    ' This function relies on the VBScript version of Regular Expressions, and thus some of
    ' the functionality available in Perl and/or .Net may not be available.  The full extent
    ' of what functionality will be available on any given computer is based on which version
    ' of the VBScript runtime is installed on that computer
   
    ' This function uses Regular Expressions to parse a string (LookIn), and return matches to a
    ' pattern (PatternStr).  Use Pos to indicate which match you want:
    ' Pos omitted               : function returns a zero-based array of all matches
    ' Pos = 1                   : the first match
    ' Pos = 2                   : the second match
    ' Pos =   : the Nth match
    ' Pos = 0                   : the last match
    ' Pos = -1                  : the last match
    ' Pos = -2                  : the 2nd to last match
    ' Pos =   : the Nth to last match
    ' If Pos is non-numeric, or if the absolute value of Pos is greater than the number of
    ' matches, the function returns an empty string.  If no match is found, the function returns
    ' an empty string.  (Earlier versions of this code used zero for the last match; this is
    ' retained for backward compatibility)
   
    ' If MatchCase is omitted or True (default for RegExp) then the Pattern must match case (and
    ' thus you may have to use [a-zA-Z] instead of just [a-z] or [A-Z]).
   
    ' ReturnType indicates what information you want to return:
    ' ReturnType = 0            : the matched values
    ' ReturnType = 1            : the starting character positions for the matched values
    ' ReturnType = 2            : the lengths of the matched values
   
    ' If you use this function in Excel, you can use range references for any of the arguments.
    ' If you use this in Excel and return the full array, make sure to set up the formula as an
    ' array formula.  If you need the array formula to go down a column, use TRANSPOSE()
   
    ' Note: RegExp counts the character positions for the Match.FirstIndex property as starting
    ' at zero.  Since VB6 and VBA has strings starting at position 1, I have added one to make
    ' the character positions conform to VBA/VB6 expectations
   
    ' Normally as an object variable I would set the RegX variable to Nothing; however, in cases
    ' where a large number of calls to this function are made, making RegX a static variable that
    ' preserves its state in between calls significantly improves performance
   
    Static RegX As Object
    Dim TheMatches As Object
    Dim Answer()
    Dim Counter As Long
   
    ' Evaluate Pos.  If it is there, it must be numeric and converted to Long
   
    If Not IsMissing(Pos) Then
        If Not IsNumeric(Pos) Then
            RegExpFind = ""
            Exit Function
        Else
            Pos = CLng(Pos)
        End If
    End If
   
    ' Evaluate ReturnType
   
    If ReturnType < 0 Or ReturnType > 2 Then
        RegExpFind = ""
        Exit Function
    End If
   
    ' Create instance of RegExp object if needed, and set properties
   
    If RegX Is Nothing Then Set RegX = CreateObject("VBScript.RegExp")
    With RegX
        .Pattern = PatternStr
        .Global = True
        .IgnoreCase = Not MatchCase
        .MultiLine = MultiLine
    End With
       
    ' Test to see if there are any matches
   
    If RegX.Test(LookIn) Then
       
        ' Run RegExp to get the matches, which are returned as a zero-based collection
       
        Set TheMatches = RegX.Execute(LookIn)
       
        ' Test to see if Pos is negative, which indicates the user wants the Nth to last
        ' match.  If it is, then based on the number of matches convert Pos to a positive
        ' number, or zero for the last match
       
        If Not IsMissing(Pos) Then
            If Pos < 0 Then
                If Pos = -1 Then
                    Pos = 0
                Else
                   
                    ' If Abs(Pos) > number of matches, then the Nth to last match does not
                    ' exist.  Return a zero-length string
                   
                    If Abs(Pos) <= TheMatches.Count Then
                        Pos = TheMatches.Count + Pos + 1
                    Else
                        RegExpFind = ""
                        GoTo Cleanup
                    End If
                End If
            End If
        End If
       
        ' If Pos is missing, user wants array of all matches.  Build it and assign it as the
        ' function's return value
       
        If IsMissing(Pos) Then
            ReDim Answer(0 To TheMatches.Count - 1)
            For Counter = 0 To UBound(Answer)
                Select Case ReturnType
                    Case 0: Answer(Counter) = TheMatches(Counter)
                    Case 1: Answer(Counter) = TheMatches(Counter).FirstIndex + 1
                    Case 2: Answer(Counter) = TheMatches(Counter).Length
                End Select
            Next
            RegExpFind = Answer
       
        ' User wanted the Nth match (or last match, if Pos = 0).  Get the Nth value, if possible
       
        Else
            Select Case Pos
                Case 0                          ' Last match
                    Select Case ReturnType
                        Case 0: RegExpFind = TheMatches(TheMatches.Count - 1)
                        Case 1: RegExpFind = TheMatches(TheMatches.Count - 1).FirstIndex + 1
                        Case 2: RegExpFind = TheMatches(TheMatches.Count - 1).Length
                    End Select
                Case 1 To TheMatches.Count      ' Nth match
                    Select Case ReturnType
                        Case 0: RegExpFind = TheMatches(Pos - 1)
                        Case 1: RegExpFind = TheMatches(Pos - 1).FirstIndex + 1
                        Case 2: RegExpFind = TheMatches(Pos - 1).Length
                    End Select
                Case Else                       ' Invalid item number
                    RegExpFind = ""
            End Select
        End If
   
    ' If there are no matches, return empty string
   
    Else
        RegExpFind = ""
    End If
   
Cleanup:
    ' Release object variables
   
    Set TheMatches = Nothing
   
End Function
What's the format of the file containing the records? Is it text, or are you reading a table from a database, or what?
If text, it's not hard to read each line one at a time and store lines with unique emails to an array. Your test for uniqueness would depend on extracting the email address and comparing it to those already in the array, so you might consider setting up the array in two dimensions, the first holding the entire line and the second just the email address.

With 3mil records, you might want to break the job down a bit to avoid memory issues (maybe a thousand records at a time, then append the array values to another text file (e.g., "UniqueDump.txt", clear the array and grab the next thousand unique lines.
When you're all through examining your original record set, re-run your code on UniqueDump.txt, this time dumping the filled arrays to a different file.

Obviously, this would be simpler if you had a database to store the data. Do you?
Acronis in Gartner 2019 MQ for datacenter backup

It is an honor to be featured in Gartner 2019 Magic Quadrant for Datacenter Backup and Recovery Solutions. Gartner’s MQ sets a high standard and earning a place on their grid is a great affirmation that Acronis is delivering on our mission to protect all data, apps, and systems.

Top Expert 2010

Commented:
In VB6 (or VBA or VBScript, for that matter), you most definitely do *not* want to use an array, as that introduces
these problems:

1) The only way to see if an item is already in the array is to loop through the array and check each element
2) ReDim Preserve only allows the last dimension to be expanded.  No big deal, that can be accommodated,
but it is a gotcha

In those languages, IMHO a Collection (VBA or VB6) or a Dictionary (VBScript) are far better options.

I have no idea if any of what I wrote above applies to VB.Net, as I am woefully ignorant of that language :)
When I get into work, i'll put together a sample for VB.NET
matthewspatrick: Good point re: arrays and their limitations. I was sort of typing in stream-of-consciousness mode at first and didn't recognize the number of records we were dealing with. I have a number of utility apps in VB6 that use arrays for small record sets, which is why they popped into my head. ReDim Preserve is a gotcha if you're not ready for it.

All in all, I agree that a Collection is definitely the better way to go, at least for VB6.

I'd just like to note that, when I began my post, yours hadn't shown up yet or I wouldn't have raised my head at all!
Top Expert 2010

Commented:
PandaPants,

I don't mind that you posted one bit :)

Patrick
The author has said that there is 3mil+ records. But how many of those would be unique emails?
Top Expert 2010

Commented:
bromy2004 said:
>>The author has said that there is 3mil+ records. But how many of those would be unique emails?

No idea.  But at least in VB6 or VBA, once your array has more than a handful of elements, or at most a few
dozen, a Collection will perform better.

Author

Commented:
In a quick dedupe just using the emails (cant do this in the final output becuase I need the full line) theres about 200k that are not unqiue.

This is a text file, nothing more right now. With lines from the above :) For the most case I haven't had issues processing large files like this, but I know after a certain limit I do. But I know theres been ways around it, not sure about 'deduping' ways though.

Thanks everyone for your inputs so far!
I tried to upload to the EE sister site www.ee-stuff.com without any success.
i had a form in VB.NET with all the buttons/text fields

attached is the code for it.

From1 contains:
Textbox1
Textbox2
Button1
Button2
OpenFileDialog1

Button1 opens the file and assigns a new name for the De-Duped file

Button2 De-Dupes the file.

Imports System.IO
Public Class Form1

  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim tmpArray As String()
    With Me.OpenFileDialog1
      .Multiselect = False
      .Title = "Duplicate File"
      .SafeFileNames(0) = ""
      .DereferenceLinks = True
      .Filter = "Text Files (*.txt,*.csv)|*.txt;*.csv|All files (*.*)|*.*"
      .FileName = ""
      If .ShowDialog = Windows.Forms.DialogResult.OK Then
        Me.TextBox1.Text = .FileName
        tmpArray = Split(.FileName, ".")
        tmpArray(UBound(tmpArray) - 1) &= " - New"
        Me.TextBox2.Text = Join(tmpArray, ".")
      End If
      .Dispose()
    End With

  End Sub

  Private Sub Button3_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button3.Click
    Dim Line As String
    Dim readFile As System.IO.TextReader
    Dim writeFile As System.IO.TextWriter
    Dim MyCollection As New Collection
    Dim Fail As Boolean

    'Copies the file to the "New" one
    If Me.TextBox1.Text <> "" Then

      File.Delete(Me.TextBox2.Text)

      If System.IO.File.Exists(TextBox1.Text) = True Then

        'Read File
        readFile = New System.IO.StreamReader(TextBox1.Text)
        'Write File
        writeFile = New System.IO.StreamWriter(TextBox2.Text)

        'Add email to collection
        'Remove if duplicate
        Do
          Line = readFile.ReadLine()

          'Try add to collection
          'if this fails its a duplicate
          Try
            MyCollection.Add(Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)), _
                               Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)))

          Catch ex As Exception
            Fail = True
          Finally
            If Not Fail Then
              writeFile.WriteLine(Line)
            End If
            Fail = False
          End Try

        Loop Until Line Is Nothing
        writeFile.Close()
        readFile.Close()
        writeFile = Nothing
        readFile = Nothing

      End If
    End If
    MsgBox("Done")
  End Sub
End Class

Open in new window

EE-De-Dupe.bmp
forgot to mention,
I haven't tested on huge amounts of data (only your sample.)
Top Expert 2010

Commented:
Valleriani,

My apologies, my original code had a stupid typo.  Please replace the line:

    For Counter = 1 To coll.Add

with:

    For Counter = 1 To coll.Count


Also, please note that that code (once the correction is applied) will run in VBA as well as VB6.  If you want VBScript
I can adapt it, just let me know.

Patrick

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial