Solved

find common data in two large files

Posted on 2008-10-07
3
367 Views
Last Modified: 2012-05-05
Find common data in two large files.
Suppose two files have billios of usernames ( each user name appended in the file)
How efficiently we can find common data.(username)
Is it possible by using B tree?
0
Comment
Question by:shwetasingh206
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 22657347
Yes, it is possible using a B tree.
If it makes a difference, and you have a choice, a B tree of the smaller of the files should be more efficient.
Or a pat trie or suffix tree may be more efficient foe some distributions of names.
A hash table could have linear time expected performance, though worst case may be quadratic.
But if you handle collisions with a B tree. worst case performance would also be n log n

0
 
LVL 5

Expert Comment

by:libin_v
ID: 22657361
If you are looking for a solution using existing tools, please find below few linux tools that could do this for you.

sort -u FILE1 > FILE1.sorted
sort -u FILE2 > FILE2.sorted
comm -12 FILE1.sorted FILE2.sorted > commonfile

The common lines are put into file commonfile
0

Featured Post

Enroll in June's Course of the Month

June’s Course of the Month is now available! Experts Exchange’s Premium Members, Team Accounts, and Qualified Experts have access to a complimentary course each month as part of their membership—an extra way to sharpen your skills and increase training.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Prime numbers are natural numbers greater than 1 that have only two divisors (the number itself and 1). By “divisible” we mean dividend % divisor = 0 (% indicates MODULAR. It gives the reminder of a division operation). We’ll follow multiple approac…
When there is a disconnect between the intentions of their creator and the recipient, when algorithms go awry, they can have disastrous consequences.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question