• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 157
  • Last Modified:

Finding differences between very large files

I have two files, let's call them "New" and "Old", each containing 25,000,000 or so lines in this format:

a
b
c
d
e

That's grossly oversimplified, each file actually contains a list of unique names i.e. no duplication within the individual file. What I want to get is a list of all items that are in Old but not in New. So if the files were like this:

New:
a
b
d
e
j
m

Old:
a
b
c
d
e
f
g
h
i
j
k
l
m

I would want a file containing:
c
f
g
h
i
k
l

I've tried using diff in the following format:

diff New Old | egrep "^<" | cut -d" " -f2 > Difference

Unfortunately this returns a number of false matches, including items that existed in both files as well as items that were in New but not Old. I'm not sure if my code is out or whether diff just can't handle files of this magnitude. I've tried loading both files into MySQL and running a query with a left join, but it takes nearly two hours to execute! Server is a dual Athlon MP 2000+ with 1GB RAM so it isn't underpowered.
0
Speedie
Asked:
Speedie
1 Solution
 
rdutaCommented:
First don't use mysql.  You shouldn't have to.

Second diff should work, but you will probably need to
run the files through sort first; this is because of the way diff looks at lines as being different.
0
 
ahoffmannCommented:
> .. including items that existed in both files
then your files are not sorted
if both files, New and Old are sorted, following should work:
   diff New Old | grep '^>' | cut -c3-
0

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now