asked on

Add dynamically to an array and compare

Hi Experts,
I need advice on what is the best and fastest way to do this. I have a huge text file. has millions of rows. I maintain two lists A and B. I need to read each line and compare if the values in ColumnA exists in list A . if it is not in the list. I add to it. If it is already there I skip adding to the list and proceed to the next line. I do the same with column B. Once the length of the listA or list B reaches a threshold say 500. I output the lists and reinitialize them and keep going.
What datastructure is best suited for this? arrays or collections or? I like to process the whole log file as quickly as possible.
Any help is very much appreciated.

SOLUTION

jurobot

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

Gurvinder Pal Singh

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

guyneo

ASKER

Thanks jurobot and gurvinder for your input.
@jurobot. HashSet offers constant time but is it on the whole faster than other options? say ArrayList.
The operations I would do mostly are contains and add?
@gurvinder The automatic removal of duplicates in HashSet seems like a nice feature. However for my current task I need to know when there is duplicate. I didnot explain all my requirements for the sake of keeping small and clear. I did implement it in ArrayLists, it worked fine. definitely usable. Ofcourse looking for a better solution keeps the learning going :)

jurobot

hi guyneo,

I guess that time complexity for all array-like implementations is linear running time for operation 'contains'. Constant time for operation 'add' (i.e. adding n elements requires O(n) time).

Set implementations using hash indexing offers constant time for operations like 'add' and 'contains'.

So from my point of view the answer for you is chooding between array-like implementation (ArrayList is array-like implementation) and hash indexing implementation.

cheers
JS

SOLUTION

__geof__

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

hazgoduk

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

guyneo

ASKER

Thanks for all your help Guys. I got help in time. For sake of learning, I tried various approaches. For my particular task,dataset, it didnot make too much of a difference so as to prefer one over the other. Ofcourse other people might find it different. If so please do share your experiences.