Process 90MB text file

I need to quickly process very large text files (90MB < 1GB ). I have found a solution, but I need a faster one.

1. Remove info from begining to code word (; NS) **Taking too long
2. Remove info from every line after the first space  **Taking way too long
3. Find records found in yesterdays list, but not todays. **Works good
4. Remove Duplicates **Works good

My Solution:
sed -e '1,/; NS/d' -e 's/ .*//' new_file | diff -e - last_file | sed '/\.\|,\|a/d'| sort -u >done_file

sed -e '1,/; NS/d' (Remove info from begining up to "; NS")
-e 's/ .*//' new_file (Remove info from every line after first space)
| diff -e - last_file (Compare to yesterdays list)
| sed '/\.\|,\|a/d' (Remove unneeded info from diff output)
| sort -u >done_file (Remove Duplicates)

Here is what I would like:
1. A better way to accomplish the first 2 parts.
2. A way to not output the line numbers from diff. Just need additions. No nums, changes, deletions

***This is needed ASAP. 250 POINTS***
LVL 1
itcdrAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wesly_chenCommented:
> A way to not output the line numbers from diff. Just need additions. No nums, changes, deletions
diff -q

> 1. Remove info from begining to code word (; NS) **Taking too long
grep -v "; NS" filename > new_file

> 2. Remove info from every line after the first space  
Can you tell more details?

Regards,

Wesly
0
wesly_chenCommented:
grep -v "; NS" new_file | sed -e 's/ .*//' | diff -eq - last_file | sed '/\.\|,\|a/d'| sort -u >done_file

Wesly
0
itcdrAuthor Commented:
1. Diff -q only outputs whether the two files differ or not. I need to output the differences, but only the additions, no nums, no changes, no deletions.

2. grep -v "; NS" filename > new_file just deleted the code word in the file. I need it to delete everything from the very beginning to the code word. I haven't used grep before. What is it mostly used for?

3. Remove info from every line after the first space:
  ie:
name1 extra info after a space
name2 more extra info not needed
name3 more unneeded info after each record
name4 I only need the names
...

I was using (sed -e 's/ .*//' file_name) to accomplish this, but this seems to be the slowest procedure of the whole project.
0
wesly_chenCommented:
> 3. Remove info from every line after the first space:
awk '{print $1}' new_file

Wesly
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
itcdrAuthor Commented:
Amazing!!! I didn't think I could make it any faster. Processing a 90MB file use to take 25 seconds, now when I replace the first 2 commands with yours, it only takes 10 seconds. That is a huge improvement. Thanks so much.

Here is the new solution:
awk '{print $1}' new_file | diff -e - old_file | sed '/\.\|a\|c\|d/d'| sort -u >done_file
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux OS Dev

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.