Valleriani
asked on
Dedupe a file based on special criteria, any program/code that does this? VB.net/VB6?
I have a file with 3mil records. an example of them are
http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=322
http://blah.com/test.php?email=marly@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=323
If you look at the example above, there are three pointed to 'john@blah.com', which is considered a duplicate, though the exact line isn't the same (CID is different). What I want to do is remove the line itself based on a duplicate email, so the result would be:
http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=marly@blah.com&cid=320
Is there any coding in VB that allows this? Or a program that can do this?.. Even if its more then 'one step' I'm all for it. Maybe some sorta regex?
Thanks!
http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=322
http://blah.com/test.php?email=marly@blah.com&cid=320
http://blah.com/test.php?email=john@blah.com&cid=323
If you look at the example above, there are three pointed to 'john@blah.com', which is considered a duplicate, though the exact line isn't the same (CID is different). What I want to do is remove the line itself based on a duplicate email, so the result would be:
http://blah.com/test.php?email=john@blah.com&cid=320
http://blah.com/test.php?email=mike@blah.com&cid=320
http://blah.com/test.php?email=marly@blah.com&cid=320
Is there any coding in VB that allows this? Or a program that can do this?.. Even if its more then 'one step' I'm all for it. Maybe some sorta regex?
Thanks!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
In VB6 (or VBA or VBScript, for that matter), you most definitely do *not* want to use an array, as that introduces
these problems:
1) The only way to see if an item is already in the array is to loop through the array and check each element
2) ReDim Preserve only allows the last dimension to be expanded. No big deal, that can be accommodated,
but it is a gotcha
In those languages, IMHO a Collection (VBA or VB6) or a Dictionary (VBScript) are far better options.
I have no idea if any of what I wrote above applies to VB.Net, as I am woefully ignorant of that language :)
these problems:
1) The only way to see if an item is already in the array is to loop through the array and check each element
2) ReDim Preserve only allows the last dimension to be expanded. No big deal, that can be accommodated,
but it is a gotcha
In those languages, IMHO a Collection (VBA or VB6) or a Dictionary (VBScript) are far better options.
I have no idea if any of what I wrote above applies to VB.Net, as I am woefully ignorant of that language :)
When I get into work, i'll put together a sample for VB.NET
matthewspatrick: Good point re: arrays and their limitations. I was sort of typing in stream-of-consciousness mode at first and didn't recognize the number of records we were dealing with. I have a number of utility apps in VB6 that use arrays for small record sets, which is why they popped into my head. ReDim Preserve is a gotcha if you're not ready for it.
All in all, I agree that a Collection is definitely the better way to go, at least for VB6.
I'd just like to note that, when I began my post, yours hadn't shown up yet or I wouldn't have raised my head at all!
All in all, I agree that a Collection is definitely the better way to go, at least for VB6.
I'd just like to note that, when I began my post, yours hadn't shown up yet or I wouldn't have raised my head at all!
PandaPants,
I don't mind that you posted one bit :)
Patrick
I don't mind that you posted one bit :)
Patrick
The author has said that there is 3mil+ records. But how many of those would be unique emails?
bromy2004 said:
>>The author has said that there is 3mil+ records. But how many of those would be unique emails?
No idea. But at least in VB6 or VBA, once your array has more than a handful of elements, or at most a few
dozen, a Collection will perform better.
>>The author has said that there is 3mil+ records. But how many of those would be unique emails?
No idea. But at least in VB6 or VBA, once your array has more than a handful of elements, or at most a few
dozen, a Collection will perform better.
ASKER
In a quick dedupe just using the emails (cant do this in the final output becuase I need the full line) theres about 200k that are not unqiue.
This is a text file, nothing more right now. With lines from the above :) For the most case I haven't had issues processing large files like this, but I know after a certain limit I do. But I know theres been ways around it, not sure about 'deduping' ways though.
Thanks everyone for your inputs so far!
This is a text file, nothing more right now. With lines from the above :) For the most case I haven't had issues processing large files like this, but I know after a certain limit I do. But I know theres been ways around it, not sure about 'deduping' ways though.
Thanks everyone for your inputs so far!
I tried to upload to the EE sister site www.ee-stuff.com without any success.
i had a form in VB.NET with all the buttons/text fields
attached is the code for it.
From1 contains:
Textbox1
Textbox2
Button1
Button2
OpenFileDialog1
Button1 opens the file and assigns a new name for the De-Duped file
Button2 De-Dupes the file.
i had a form in VB.NET with all the buttons/text fields
attached is the code for it.
From1 contains:
Textbox1
Textbox2
Button1
Button2
OpenFileDialog1
Button1 opens the file and assigns a new name for the De-Duped file
Button2 De-Dupes the file.
Imports System.IO
Public Class Form1
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim tmpArray As String()
With Me.OpenFileDialog1
.Multiselect = False
.Title = "Duplicate File"
.SafeFileNames(0) = ""
.DereferenceLinks = True
.Filter = "Text Files (*.txt,*.csv)|*.txt;*.csv|All files (*.*)|*.*"
.FileName = ""
If .ShowDialog = Windows.Forms.DialogResult.OK Then
Me.TextBox1.Text = .FileName
tmpArray = Split(.FileName, ".")
tmpArray(UBound(tmpArray) - 1) &= " - New"
Me.TextBox2.Text = Join(tmpArray, ".")
End If
.Dispose()
End With
End Sub
Private Sub Button3_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button3.Click
Dim Line As String
Dim readFile As System.IO.TextReader
Dim writeFile As System.IO.TextWriter
Dim MyCollection As New Collection
Dim Fail As Boolean
'Copies the file to the "New" one
If Me.TextBox1.Text <> "" Then
File.Delete(Me.TextBox2.Text)
If System.IO.File.Exists(TextBox1.Text) = True Then
'Read File
readFile = New System.IO.StreamReader(TextBox1.Text)
'Write File
writeFile = New System.IO.StreamWriter(TextBox2.Text)
'Add email to collection
'Remove if duplicate
Do
Line = readFile.ReadLine()
'Try add to collection
'if this fails its a duplicate
Try
MyCollection.Add(Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)), _
Mid(Line, Line.IndexOf("email=") + 7, Line.IndexOf("&cid=") - (Line.IndexOf("email=") + 6)))
Catch ex As Exception
Fail = True
Finally
If Not Fail Then
writeFile.WriteLine(Line)
End If
Fail = False
End Try
Loop Until Line Is Nothing
writeFile.Close()
readFile.Close()
writeFile = Nothing
readFile = Nothing
End If
End If
MsgBox("Done")
End Sub
End Class
EE-De-Dupe.bmp
forgot to mention,
I haven't tested on huge amounts of data (only your sample.)
I haven't tested on huge amounts of data (only your sample.)
Valleriani,
My apologies, my original code had a stupid typo. Please replace the line:
For Counter = 1 To coll.Add
with:
For Counter = 1 To coll.Count
Also, please note that that code (once the correction is applied) will run in VBA as well as VB6. If you want VBScript
I can adapt it, just let me know.
Patrick
My apologies, my original code had a stupid typo. Please replace the line:
For Counter = 1 To coll.Add
with:
For Counter = 1 To coll.Count
Also, please note that that code (once the correction is applied) will run in VBA as well as VB6. If you want VBScript
I can adapt it, just let me know.
Patrick
Is it a CSV/TXT doc?
Or a local database?
It sounds possible, but without more info we can't make much progress