Link to home
Start Free TrialLog in
Avatar of Valleriani
VallerianiFlag for Sweden

asked on

Dedupe a file based on special criteria, any program/code that does this? VB.net/VB6?

I have a file with 3mil records. an example of them are

http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=322
http://blah.com/test.php?email=marly@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=323

If you look at the example above, there are three pointed to 'john@blah.com', which is considered a duplicate, though the exact line isn't the same (CID is different). What I want to do is remove the line itself based on a duplicate email, so the result would be:


http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=marly@blah.com&cid=320


Is there any coding in VB that allows this? Or a program that can do this?.. Even if its more then 'one step' I'm all for it.  Maybe some sorta regex?

Thanks!
Avatar of bromy2004
bromy2004
Flag of Australia image

What is the file format?
Is it a CSV/TXT doc?
Or a local database?

It sounds possible, but without more info we can't make much progress
ASKER CERTIFIED SOLUTION
Avatar of Patrick Matthews
Patrick Matthews
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
In VB6 (or VBA or VBScript, for that matter), you most definitely do *not* want to use an array, as that introduces
these problems:

1) The only way to see if an item is already in the array is to loop through the array and check each element
2) ReDim Preserve only allows the last dimension to be expanded.  No big deal, that can be accommodated,
but it is a gotcha

In those languages, IMHO a Collection (VBA or VB6) or a Dictionary (VBScript) are far better options.

I have no idea if any of what I wrote above applies to VB.Net, as I am woefully ignorant of that language :)
When I get into work, i'll put together a sample for VB.NET
matthewspatrick: Good point re: arrays and their limitations. I was sort of typing in stream-of-consciousness mode at first and didn't recognize the number of records we were dealing with. I have a number of utility apps in VB6 that use arrays for small record sets, which is why they popped into my head. ReDim Preserve is a gotcha if you're not ready for it.

All in all, I agree that a Collection is definitely the better way to go, at least for VB6.

I'd just like to note that, when I began my post, yours hadn't shown up yet or I wouldn't have raised my head at all!
PandaPants,

I don't mind that you posted one bit :)

Patrick
The author has said that there is 3mil+ records. But how many of those would be unique emails?
bromy2004 said:
>>The author has said that there is 3mil+ records. But how many of those would be unique emails?

No idea.  But at least in VB6 or VBA, once your array has more than a handful of elements, or at most a few
dozen, a Collection will perform better.
Avatar of Valleriani

ASKER

In a quick dedupe just using the emails (cant do this in the final output becuase I need the full line) theres about 200k that are not unqiue.

This is a text file, nothing more right now. With lines from the above :) For the most case I haven't had issues processing large files like this, but I know after a certain limit I do. But I know theres been ways around it, not sure about 'deduping' ways though.

Thanks everyone for your inputs so far!
I tried to upload to the EE sister site www.ee-stuff.com without any success.
i had a form in VB.NET with all the buttons/text fields

attached is the code for it.

From1 contains:
Textbox1
Textbox2
Button1
Button2
OpenFileDialog1

Button1 opens the file and assigns a new name for the De-Duped file

Button2 De-Dupes the file.

Imports System.IO
Public Class Form1

  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim tmpArray As String()
    With Me.OpenFileDialog1
      .Multiselect = False
      .Title = "Duplicate File"
      .SafeFileNames(0) = ""
      .DereferenceLinks = True
      .Filter = "Text Files (*.txt,*.csv)|*.txt;*.csv|All files (*.*)|*.*"
      .FileName = ""
      If .ShowDialog = Windows.Forms.DialogResult.OK Then
        Me.TextBox1.Text = .FileName
        tmpArray = Split(.FileName, ".")
        tmpArray(UBound(tmpArray) - 1) &= " - New"
        Me.TextBox2.Text = Join(tmpArray, ".")
      End If
      .Dispose()
    End With

  End Sub

  Private Sub Button3_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button3.Click
    Dim Line As String
    Dim readFile As System.IO.TextReader
    Dim writeFile As System.IO.TextWriter
    Dim MyCollection As New Collection
    Dim Fail As Boolean

    'Copies the file to the "New" one
    If Me.TextBox1.Text <> "" Then

      File.Delete(Me.TextBox2.Text)

      If System.IO.File.Exists(TextBox1.Text) = True Then

        'Read File
        readFile = New System.IO.StreamReader(TextBox1.Text)
        'Write File
        writeFile = New System.IO.StreamWriter(TextBox2.Text)

        'Add email to collection
        'Remove if duplicate
        Do
          Line = readFile.ReadLine()

          'Try add to collection
          'if this fails its a duplicate
          Try
            MyCollection.Add(Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)), _
                               Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)))

          Catch ex As Exception
            Fail = True
          Finally
            If Not Fail Then
              writeFile.WriteLine(Line)
            End If
            Fail = False
          End Try

        Loop Until Line Is Nothing
        writeFile.Close()
        readFile.Close()
        writeFile = Nothing
        readFile = Nothing

      End If
    End If
    MsgBox("Done")
  End Sub
End Class

Open in new window

EE-De-Dupe.bmp
forgot to mention,
I haven't tested on huge amounts of data (only your sample.)
Valleriani,

My apologies, my original code had a stupid typo.  Please replace the line:

    For Counter = 1 To coll.Add

with:

    For Counter = 1 To coll.Count


Also, please note that that code (once the correction is applied) will run in VBA as well as VB6.  If you want VBScript
I can adapt it, just let me know.

Patrick