Solved

Help with fixing a script

Posted on 2011-02-10
3
396 Views
Last Modified: 2012-05-11
Hi,

Can someone please provide a working version of this script? I can't seem to figure out how to make this work. Also, if there is a better way to do this with another language that will return the same output, that would be great as well.

The main function of the script is to take a file as input (keyword.txt) with keywords on each line (space delimited) overlap it with Unix built in dictionary and output only keywords that are found in the dictionary. (At least that's my understanding of it)

Thanks

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;for (i=1;i<NF;i++) if($i=="keyword"){ j=i;i=1000;};if(j>0) {if(j==1) j++;if(j==NF) j--;k=j-1;l=j+1; print $k "\t" $j "\t" $l;}}'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort


llc.sh

#!/bin/sh
olda=""
while read c b a
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
 done
 

ll.sh

#!/bin/sh
olda=""
while read a b c
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
 done
 
0
Comment
Question by:faithless1
  • 2
3 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
ID: 34870123
What this script appears to do is take the input file, remove any web addresses (text followed by a dot followed by com, net or org), any non-alphabetic characters some common words (with, from, txt, and, for, the and com) and any 1- or 2-character words.

It then looks for the first word "keyword" on the line, and prints out the word before and after it (or the two words after it if it is the first word on the line), so three words.  It was supposed to print out the last three words if "keyword" was last on the line, but the "if (i" loop stops before the last field.

It then prints out the three words if the first and third word (usually either side of "keyword") are in the dictionary.  To save effort, it doesn't check the dictionary if a word is repeated, just uses the previous result.

There are some scripting changes I would make to the provided scripts:
- "/dell/null" should be /dev/null, and is anyway not required since you are using "grep -q" which doesn't produce any output.
- the pattern you are looking for in the dictionary needs a trailing "$" to mark the end of the word - otherwise if will treat a prefix as a valid word (e.g. "produc" will appear to be a valid word because it is a prefix of "produce")
- In main.sh, the second awk should be "for (i=1;i<=NF;i++) {", not "for (i=1;i<NF;i++) {", so that "keyword" at the end of the line is matched.
- The way of stopping that "for (i=1;" loop (setting i to 1000) is untidy - it would fail if there were more than 1000 fields on the line, and is just a bit obscure.  Just put "next;" after you have printed out that first keyword.
- When checking for .com, .net and .org, you should include 0-9 in your pattern, in case the domain name ends with a digit.

Modified scripts (with a little reformatting to make the logic more obvious) are:
main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;
                 if(j==1) j++;
                 if(j==NF) j--;
                 k=j-1;
                 l=j+1;
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort 

Open in new window


llc.sh
#!/bin/sh
olda=""
while read c b a
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a$" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
done

Open in new window


ll.sh
#!/bin/sh
olda=""
while read a b c
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
done

Open in new window

0
 
LVL 19

Assisted Solution

by:simon3270
simon3270 earned 500 total points
ID: 34870231
One drawback of the above is that if keyword is the first or last word on the line, then you end up checking that "keyword" is in the dictionary, and never check the middle word of the three output.  If you use the main.sh below, it will always put keyword as the middle output word, so will always check both other words on the output line.

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;k=j-1;l=j+1;
                 if(i==1) {k=2;l=3;}
                 if(i==NF) {k=NF-2;l=NF-1;}
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort

Open in new window

0
 

Author Closing Comment

by:faithless1
ID: 34878001
Superb, thank you very much!!!!!! It took me a while to understand everything above and I think it now makes perfect sense. Thanks again for your help
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Utilizing an array to gracefully append to a list of EmailAddresses
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now