Perl Help

Hello,

I have a file with approximately 700,000 records. The records consist of phrases anywhere between 3-10 on each line maybe more.

I'm looking for a way to remove words which occur less than 1,000 times across the entire file.

I use the following command to generate a list of words(instances):

tr ' ' '\n' |sort |uniq -c

I tried using the following command but it did not produce any results:

perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt > newfile.txt

Thanks in advance.
faithless1Asked:
Who is Participating?
 
namethisConnect With a Mentor Commented:
Your perl script is okay, you just need to specify file.txt twice:
perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt file.txt > newfile.txt

Open in new window

0
 
tel2Commented:
Hi FL1,

If you're happy with the word frequency list that your "tr..." line is producing, then there's probably not much need for Perl here.

Please replace your entire "tr... " line with the following, where 'wordfile' is the input file:
    tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      '
and if you're happy with that, but you don't want the frequencies, use this instead:
    tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      ' | cut -c9-
Note: There are 6 spaces after the "^".

If that doesn't work, please post the exact code you're running, and tell us what went wrong.
0
 
tel2Commented:
...and in case your version of 'uniq' spaces things differently from mine, here are alternatives to the above 2 commands:
    tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}"
and to remove the frequencies:
    tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}" | awk '{print $2}'
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.