Link to home
Start Free TrialLog in
Avatar of Octalys
Octalys

asked on

I have two very large list of strings, and I am comparing list2 with list1, it need to merge but duplicates are not allowed, how do I do this?

I have two very large list of strings, and I am comparing list2 with list1, it need to merge the 2 lists but duplicates are not allowed, how do I do this?
I need a checkDuplicate function or something like that.
Avatar of S-Twilley
S-Twilley

well, im not sure what sorta code you have in place already, but as a pointer to you, or to someone else (unless i get there before them)... I think the fastest way to do this would be:

Append list2 to list1 and call it list3
Sort list3  (different implementations can be used)
Remove duplicates (should be easy since list is sorted)
Output list3

Id advise splitting this up into seperate methods/functions so you can easily replace say the sorting function with a more efficient one.

I'll try and work on something now, but thought I'd post this up so you might work on something urself
Avatar of Octalys

ASKER

it must be live, it must check duplicate when it starts to compare each line from list 2 with list 1.

ok, in your first post you say you are comparing list2 to 1... does that mean that list1 remains unchanged while list2 is getting items added (since u say it needs to be live)... sorry if im missing the point here as im trying to see the flaw in my approach as mentioned... it'll be inefficient if both lists are constantly changing.  IF possible, if you could explain how these lists are being built or change, so that we can find the most efficient method of getting the merged "clean" list
Avatar of Octalys

ASKER

yes list1 doesnt change, and list1 is dublicate free.

the app i am making is actually like this, we have a very big local intranet website, like 50000 pages(!). I am making a crawler that goes tru our intranet to collect "links", "image links" etc. etc. I am storing the links in a ?? string array I guesS? not sure where I would store it yet.

So crawlers appends links/images links in a list, and everytime a new link is found it compares to the list to see if it already exist or not. etc.

Thats the function I need to make, its a part of a bigger project.
Ok, so understanding so far:

List 1 = Clean list with no duplicates
List 2 = what exactly... the links collected from one page... or just a dynamic list which is built whilst scanning pages?  can list2 be "cleared" once items have been checked off in list 1... or must it be a complete "dirty" list with duplicates
Avatar of Octalys

ASKER

its a dynamic list built while scanning, actually found links.
a string in list2 can be cleaned when its compared.
I think for list1... since it appears it could be very large... some sort of linked list or dataset may be prefereable instead of an array... especially when it comes to sorting them...  it may be more efficient to sort by the domain, then sumdomain of any link/image urls.... and have pointers to the different groups of domans (first domain beginning with an A, first one beginning with a B) so speed up the search when checking for duplicates. The other advantage to using a linked-list or dataset would be that you can easily insert a new item in it's correct position without reshuffling all the items.

As for list2, it doesnt really matter... some sort of stack would be good enough, taking off the most recently added item, remove it, then check if its in List1... add it if necessary.

Before I code any of this (might not be tonight, just so you know)... just wanted to check you were familar with the concepts.
Avatar of Mike Tomlinson
If order is not important then just use a HASHTABLE with each string as the value and key.  You can check the table to see if the string already exists then...
only problem i can foresee using hash-table, arraylist or something similar is the size of the object... if there are 50,000+ pages... with say 20 urls (from links and images) on each, the array will have 1,000,000 items... im not sure how well handled hashtables are.  I think the biggest problem in this isn't really the implementation as such, but more to do with performance and limitations (due to memory or the like)
@S-Twilley:
Steve, I wouldn't concern myself with memory issues, since most data structures will have similar overheads.  The biggest advantage of hash tables is speed, since the best case is O(1), so it doesn't normally slow down even when dealing with very large elements counts.

Bob
Why not use a ListBox, or equivalent, to see if it contains it. For example:

        With ListBox1.Items
            .Add("One")
            .Add("Two")
            .Add("Three")
            If .Contains("Two") Then
                MessageBox.Show("Contains Two")
            Else
                MessageBox.Show("Does not contain Two")
            End If
            If .Contains("Four") Then
                MessageBox.Show("Contains Four")
            Else
                MessageBox.Show("Does not contain Four")
            End If
        End With

This has the first messagebox "Contains Two" and the second messagebox "Does not contain Four". In your case, if it does not already exist, add your entry.
Because, when there is a large number of items, searching through the Items collection, using Contains will get worse and worse, making it unusable for larger counts.

Bob
True with hash tables, too. Tables have to be searched in some fashion.
ASKER CERTIFIED SOLUTION
Avatar of Mike Tomlinson
Mike Tomlinson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes...but HashTables are designed specifically with speed of searching for keys in mind:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemcollectionshashtableclasstopic.asp

    "When an element is added to the Hashtable, the element is placed into a bucket based on the hash code of the key. Subsequent lookups of the key use the hash code of the key to search in only one particular bucket, thus substantially reducing the number of key comparisons required to find an element."

Yes, yes, familiar with Big-O. I agree with Idle_Mind about optimized for keys, which can be significant.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial