?
Solved

Help with fixing a script

Posted on 2011-02-10
3
Medium Priority
?
406 Views
Last Modified: 2012-05-11
Hi,

Can someone please provide a working version of this script? I can't seem to figure out how to make this work. Also, if there is a better way to do this with another language that will return the same output, that would be great as well.

The main function of the script is to take a file as input (keyword.txt) with keywords on each line (space delimited) overlap it with Unix built in dictionary and output only keywords that are found in the dictionary. (At least that's my understanding of it)

Thanks

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;for (i=1;i<NF;i++) if($i=="keyword"){ j=i;i=1000;};if(j>0) {if(j==1) j++;if(j==NF) j--;k=j-1;l=j+1; print $k "\t" $j "\t" $l;}}'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort


llc.sh

#!/bin/sh
olda=""
while read c b a
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
 done
 

ll.sh

#!/bin/sh
olda=""
while read a b c
do
if [ "$a" != "$olda" ]
 then
grep -q "^$a" /usr/share/dict/words >/dell/null
  valid=$?
 fi
 olda=$a
 [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
 done
 
0
Comment
Question by:faithless1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 20

Accepted Solution

by:
simon3270 earned 2000 total points
ID: 34870123
What this script appears to do is take the input file, remove any web addresses (text followed by a dot followed by com, net or org), any non-alphabetic characters some common words (with, from, txt, and, for, the and com) and any 1- or 2-character words.

It then looks for the first word "keyword" on the line, and prints out the word before and after it (or the two words after it if it is the first word on the line), so three words.  It was supposed to print out the last three words if "keyword" was last on the line, but the "if (i" loop stops before the last field.

It then prints out the three words if the first and third word (usually either side of "keyword") are in the dictionary.  To save effort, it doesn't check the dictionary if a word is repeated, just uses the previous result.

There are some scripting changes I would make to the provided scripts:
- "/dell/null" should be /dev/null, and is anyway not required since you are using "grep -q" which doesn't produce any output.
- the pattern you are looking for in the dictionary needs a trailing "$" to mark the end of the word - otherwise if will treat a prefix as a valid word (e.g. "produc" will appear to be a valid word because it is a prefix of "produce")
- In main.sh, the second awk should be "for (i=1;i<=NF;i++) {", not "for (i=1;i<NF;i++) {", so that "keyword" at the end of the line is matched.
- The way of stopping that "for (i=1;" loop (setting i to 1000) is untidy - it would fail if there were more than 1000 fields on the line, and is just a bit obscure.  Just put "next;" after you have printed out that first keyword.
- When checking for .com, .net and .org, you should include 0-9 in your pattern, in case the domain name ends with a digit.

Modified scripts (with a little reformatting to make the logic more obvious) are:
main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;
                 if(j==1) j++;
                 if(j==NF) j--;
                 k=j-1;
                 l=j+1;
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort 

Open in new window


llc.sh
#!/bin/sh
olda=""
while read c b a
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a$" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$c\t$b\t$a"
done

Open in new window


ll.sh
#!/bin/sh
olda=""
while read a b c
do
  if [ "$a" != "$olda" ]
  then
    grep -q "^$a" /usr/share/dict/words
    valid=$?
  fi
  olda=$a
  [ $valid  -eq 0 ] && echo -e "$a\t$b\t$c"
done

Open in new window

0
 
LVL 20

Assisted Solution

by:simon3270
simon3270 earned 2000 total points
ID: 34870231
One drawback of the above is that if keyword is the first or last word on the line, then you end up checking that "keyword" is in the dictionary, and never check the middle word of the three output.  If you use the main.sh below, it will always put keyword as the middle output word, so will always check both other words on the output line.

main.sh
#!/bin/sh
cat $1 | tr A-Z a-z| grep keyword | cat -v|\
sed  -e 's/\([a-z0-9]\+\.\)\+\(com\|org\|net\)[^ ]*//g' -e 's/[^a-z ]/ /g' -e 's/[         ]\+/ /g' |\
awk '{j=-1;for (i=1;i<=NF;i++)if(length($i) < 3 || match($i,"^with$|^from$|^txt$|^and$|^for$|^the$|^com$") ) $i="";print }'|\
sed -e 's/^[        ]\+//g'|\
awk '(NF>2) {j=-1;
             for (i=1;i<=NF;i++) {
               if($i=="keyword") {
                 j=i;k=j-1;l=j+1;
                 if(i==1) {k=2;l=3;}
                 if(i==NF) {k=NF-2;l=NF-1;}
                 print $k "\t" $j "\t" $l;
                 next;
               }
             }
            }'|\
sort -u |sh ll.sh |sort -k +3|sh llc.sh |sort

Open in new window

0
 

Author Closing Comment

by:faithless1
ID: 34878001
Superb, thank you very much!!!!!! It took me a while to understand everything above and I think it now makes perfect sense. Thanks again for your help
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

How to remove superseded packages in windows w60 or w61 installation media (.wim) or online system to prevent unnecessary space. w60 means Windows Vista or Windows Server 2008. w61 means Windows 7 or Windows Server 2008 R2. There are various …
We are witnesses that everyone is saying that our children shouldn't "play" with a technology because it is dangerous. This article is going to prove that they are wrong.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

718 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question