Perl Help

Posted on 2012-08-26
Last Modified: 2012-10-30

I have a file with approximately 700,000 records. The records consist of phrases anywhere between 3-10 on each line maybe more.

I'm looking for a way to remove words which occur less than 1,000 times across the entire file.

I use the following command to generate a list of words(instances):

tr ' ' '\n' |sort |uniq -c

I tried using the following command but it did not produce any results:

perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt > newfile.txt

Thanks in advance.
Question by:faithless1
    LVL 11

    Expert Comment

    Hi FL1,

    If you're happy with the word frequency list that your "tr..." line is producing, then there's probably not much need for Perl here.

    Please replace your entire "tr... " line with the following, where 'wordfile' is the input file:
        tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      '
    and if you're happy with that, but you don't want the frequencies, use this instead:
        tr ' ' '\n' <wordfile | sort | uniq -c | grep -v '^      ' | cut -c9-
    Note: There are 6 spaces after the "^".

    If that doesn't work, please post the exact code you're running, and tell us what went wrong.
    LVL 11

    Expert Comment

    ...and in case your version of 'uniq' spaces things differently from mine, here are alternatives to the above 2 commands:
        tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}"
    and to remove the frequencies:
        tr ' ' '\n' <in1 | sort | uniq -c | grep "[0-9]\{4,\}" | awk '{print $2}'
    LVL 2

    Accepted Solution

    Your perl script is okay, you just need to specify file.txt twice:
    perl -ne 's/(\S+)/$s{$1}/g,print,next if !@ARGV; ++$s{$_} for split; if( eof ){ $s{$_}=$s{$_}>=1000&&$_ for keys %s}' file.txt file.txt > newfile.txt

    Open in new window


    Featured Post

    IT, Stop Being Called Into Every Meeting

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    Join & Write a Comment

    Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
    In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
    This video is in connection to the article "The case of a missing mobile phone (". It will help one to understand clearly the steps to track a lost android phone.

    755 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now