Perl Help


I have a file with approximately 700,000 records. The records consist of phrases anywhere between 3-10 on each line maybe more.

I'm looking for a way to remove words which occur less than 1,000 times across the entire file.

I use the following command to generate a list of words(instances):

tr ' ' '\n' |sort |uniq -c

I tried using the following command but it did not produce any results:

perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt > newfile.txt

Thanks in advance.
Who is Participating?
namethisConnect With a Mentor Commented:
Your perl script is okay, you just need to specify file.txt twice:
perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt file.txt > newfile.txt

Open in new window

Hi FL1,

If you're happy with the word frequency list that your "tr..." line is producing, then there's probably not much need for Perl here.

Please replace your entire "tr... " line with the following, where 'wordfile' is the input file:
    tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      '
and if you're happy with that, but you don't want the frequencies, use this instead:
    tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      ' | cut -c9-
Note: There are 6 spaces after the "^".

If that doesn't work, please post the exact code you're running, and tell us what went wrong.
...and in case your version of 'uniq' spaces things differently from mine, here are alternatives to the above 2 commands:
    tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}"
and to remove the frequencies:
    tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}" | awk '{print $2}'
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.