Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 187
  • Last Modified:

Process 90MB text file

I need to quickly process very large text files (90MB < 1GB ). I have found a solution, but I need a faster one.

1. Remove info from begining to code word (; NS) **Taking too long
2. Remove info from every line after the first space  **Taking way too long
3. Find records found in yesterdays list, but not todays. **Works good
4. Remove Duplicates **Works good

My Solution:
sed -e '1,/; NS/d' -e 's/ .*//' new_file | diff -e - last_file | sed '/\.\|,\|a/d'| sort -u >done_file

sed -e '1,/; NS/d' (Remove info from begining up to "; NS")
-e 's/ .*//' new_file (Remove info from every line after first space)
| diff -e - last_file (Compare to yesterdays list)
| sed '/\.\|,\|a/d' (Remove unneeded info from diff output)
| sort -u >done_file (Remove Duplicates)

Here is what I would like:
1. A better way to accomplish the first 2 parts.
2. A way to not output the line numbers from diff. Just need additions. No nums, changes, deletions

***This is needed ASAP. 250 POINTS***
0
itcdr
Asked:
itcdr
  • 3
  • 2
1 Solution
 
wesly_chenCommented:
> A way to not output the line numbers from diff. Just need additions. No nums, changes, deletions
diff -q

> 1. Remove info from begining to code word (; NS) **Taking too long
grep -v "; NS" filename > new_file

> 2. Remove info from every line after the first space  
Can you tell more details?

Regards,

Wesly
0
 
wesly_chenCommented:
grep -v "; NS" new_file | sed -e 's/ .*//' | diff -eq - last_file | sed '/\.\|,\|a/d'| sort -u >done_file

Wesly
0
 
itcdrAuthor Commented:
1. Diff -q only outputs whether the two files differ or not. I need to output the differences, but only the additions, no nums, no changes, no deletions.

2. grep -v "; NS" filename > new_file just deleted the code word in the file. I need it to delete everything from the very beginning to the code word. I haven't used grep before. What is it mostly used for?

3. Remove info from every line after the first space:
  ie:
name1 extra info after a space
name2 more extra info not needed
name3 more unneeded info after each record
name4 I only need the names
...

I was using (sed -e 's/ .*//' file_name) to accomplish this, but this seems to be the slowest procedure of the whole project.
0
 
wesly_chenCommented:
> 3. Remove info from every line after the first space:
awk '{print $1}' new_file

Wesly
0
 
itcdrAuthor Commented:
Amazing!!! I didn't think I could make it any faster. Processing a 90MB file use to take 25 seconds, now when I replace the first 2 commands with yours, it only takes 10 seconds. That is a huge improvement. Thanks so much.

Here is the new solution:
awk '{print $1}' new_file | diff -e - old_file | sed '/\.\|a\|c\|d/d'| sort -u >done_file
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now