?
Solved

Finding differences between very large files

Posted on 2003-03-13
2
Medium Priority
?
153 Views
Last Modified: 2010-04-21
I have two files, let's call them "New" and "Old", each containing 25,000,000 or so lines in this format:

a
b
c
d
e

That's grossly oversimplified, each file actually contains a list of unique names i.e. no duplication within the individual file. What I want to get is a list of all items that are in Old but not in New. So if the files were like this:

New:
a
b
d
e
j
m

Old:
a
b
c
d
e
f
g
h
i
j
k
l
m

I would want a file containing:
c
f
g
h
i
k
l

I've tried using diff in the following format:

diff New Old | egrep "^<" | cut -d" " -f2 > Difference

Unfortunately this returns a number of false matches, including items that existed in both files as well as items that were in New but not Old. I'm not sure if my code is out or whether diff just can't handle files of this magnitude. I've tried loading both files into MySQL and running a query with a left join, but it takes nearly two hours to execute! Server is a dual Athlon MP 2000+ with 1GB RAM so it isn't underpowered.
0
Comment
Question by:Speedie
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 1

Expert Comment

by:rduta
ID: 8131226
First don't use mysql.  You shouldn't have to.

Second diff should work, but you will probably need to
run the files through sort first; this is because of the way diff looks at lines as being different.
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 300 total points
ID: 8131595
> .. including items that existed in both files
then your files are not sorted
if both files, New and Old are sorted, following should work:
   diff New Old | grep '^>' | cut -c3-
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command l…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…
How to fix incompatible JVM issue while installing Eclipse While installing Eclipse in windows, got one error like above and unable to proceed with the installation. This video describes how to successfully install Eclipse. How to solve incompa…
Suggested Courses

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question