Solved

Help with fixing a script

Posted on 2011-02-10
3
403 Views
Last Modified: 2012-05-11
Hi,

Can someone please provide a working version of this script? I can't seem to figure out how to make this work. Also, if there is a better way to do this with another language that will return the same output, that would be great as well.

The main function of the script is to take a file as input (keyword.txt) with keywords on each line (space delimited) overlap it with Unix built in dictionary and output only keywords that are found in the dictionary. (At least that's my understanding of it)

Thanks

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;for (i=1;i<NF;i++) if($i=="keyword"){ j=i;i=1000;};if(j>0) {if(j==1) j++;if(j==NF) j--;k=j-1;l=j+1; print $k "\t" $j "\t" $l;}}'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort


llc.sh

#!/bin/sh
olda=""
while read c b a
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
 done
 

ll.sh

#!/bin/sh
olda=""
while read a b c
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
 done
 
0
Comment
Question by:faithless1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
ID: 34870123
What this script appears to do is take the input file, remove any web addresses (text followed by a dot followed by com, net or org), any non-alphabetic characters some common words (with, from, txt, and, for, the and com) and any 1- or 2-character words.

It then looks for the first word "keyword" on the line, and prints out the word before and after it (or the two words after it if it is the first word on the line), so three words.  It was supposed to print out the last three words if "keyword" was last on the line, but the "if (i" loop stops before the last field.

It then prints out the three words if the first and third word (usually either side of "keyword") are in the dictionary.  To save effort, it doesn't check the dictionary if a word is repeated, just uses the previous result.

There are some scripting changes I would make to the provided scripts:
- "/dell/null" should be /dev/null, and is anyway not required since you are using "grep -q" which doesn't produce any output.
- the pattern you are looking for in the dictionary needs a trailing "$" to mark the end of the word - otherwise if will treat a prefix as a valid word (e.g. "produc" will appear to be a valid word because it is a prefix of "produce")
- In main.sh, the second awk should be "for (i=1;i<=NF;i++) {", not "for (i=1;i<NF;i++) {", so that "keyword" at the end of the line is matched.
- The way of stopping that "for (i=1;" loop (setting i to 1000) is untidy - it would fail if there were more than 1000 fields on the line, and is just a bit obscure.  Just put "next;" after you have printed out that first keyword.
- When checking for .com, .net and .org, you should include 0-9 in your pattern, in case the domain name ends with a digit.

Modified scripts (with a little reformatting to make the logic more obvious) are:
main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;
                 if(j==1) j++;
                 if(j==NF) j--;
                 k=j-1;
                 l=j+1;
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort 

Open in new window


llc.sh
#!/bin/sh
olda=""
while read c b a
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a$" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
done

Open in new window


ll.sh
#!/bin/sh
olda=""
while read a b c
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
done

Open in new window

0
 
LVL 19

Assisted Solution

by:simon3270
simon3270 earned 500 total points
ID: 34870231
One drawback of the above is that if keyword is the first or last word on the line, then you end up checking that "keyword" is in the dictionary, and never check the middle word of the three output.  If you use the main.sh below, it will always put keyword as the middle output word, so will always check both other words on the output line.

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;k=j-1;l=j+1;
                 if(i==1) {k=2;l=3;}
                 if(i==NF) {k=NF-2;l=NF-1;}
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort

Open in new window

0
 

Author Closing Comment

by:faithless1
ID: 34878001
Superb, thank you very much!!!!!! It took me a while to understand everything above and I think it now makes perfect sense. Thanks again for your help
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Over the years I've spent many an hour playing on hardened, DMZ'd servers, with only a sub-set of the usual GNU toy's to keep me company; frequently I've needed to save and send log or data extracts from these server back to my PC, or to others, and…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

626 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question