Solved

Google Keyword Research

Posted on 2011-03-08
11
427 Views
Last Modified: 2012-05-11
Hello,

I was looking at Google's keyword tool earlier and noticed that if you input a website, the tool generates a list of keywords.

https://adwords.google.com/select/KeywordToolExternal

I have a large list of domains and thought it would be easier to accomplish this with 'wget' by passing a list of domains as an argument and printing 1 word before credit and one after except to generate a list.

I came across this online:

wget -m -i input_file (domains)

Is there a better way of doing this without having to download each site locally? Also, what would be the command to grep for "credit" and output surrounding words.?

Thanks
0
Comment
Question by:faithless1
  • 5
  • 4
  • 2
11 Comments
 
LVL 12

Expert Comment

by:mwochnick
ID: 35078635
here's a great article on grep with examples http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/     still thinking about  the other part of your question
0
 
LVL 12

Expert Comment

by:mwochnick
ID: 35078667
you could attempt to use curl to post the form for each website, but I'm not sure you can get past the captcha part of the form
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35082382
So you have one URL per line in your file, right.

This should do close to what you want to do.
You should have lynx installed on your system
 
urls=$(<input_file)
for url in $urls
do
    lynx -dump | grep -i credit >> output_file
done

Open in new window

0
 

Author Comment

by:faithless1
ID: 35088180
Thanks.

If possible, can you provide usage since I'm fairly new.

Do I place this into a file and name it php and then run 'php script.php'? Thank you


urls=$(<domains.txt)
for url in $urls
do
    lynx -dump | grep -i credit >> domains_output.txt
done
0
 
LVL 31

Expert Comment

by:farzanj
ID: 35088220
This in bash.
You can paste or type on the shell, if your default shell is bash
Or, you can paste it into a file, script.sh

 
#!/bin/bash

urls=$(<domains.txt)
for url in $urls
do
    lynx -dump | grep -i credit >> domains_output.txt
done

Open in new window

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:faithless1
ID: 35089809
Thanks,

I'm getting an empty output file.



Here's the script I'm using (script.sh)

#!/bin/bash



urls=$(<domains.txt)
for url in $urls
do
    links -dump | grep -i credit >> domains_output.txt
done


domain.txt includes

http://www.site.com

command
sh script.sh

Can't seem to figure out why there is no output

Thanks
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 500 total points
ID: 35089973
You were not supposed to run it like this.

You should have done this

#make it executable--just once
chmod +x script.sh

#run it
./script.sh


Second try some domain where you know you should find credit and do this
lynx -dump | grep -i credit


Also, I am using lynx, you are using links.

If you don't have lynx, install it or use curl
0
 

Author Comment

by:faithless1
ID: 35090710
I just tried running it but still no output..

#!/bin/bash

urls=$(<domains.txt)
for url in $urls
do
    lynx -dump | grep -i credit >> domains_output.txt
done

I did test lynx on the command line and it worked fine
lynx -dump | grep -i credit
0
 
LVL 31

Accepted Solution

by:
farzanj earned 500 total points
ID: 35093779
Try this:
#!/bin/bash

urls=$(<domains.txt)
for url in $urls
do
    lynx -dump $url | grep -i credit >> domains_output.txt
done

Open in new window

0
 

Author Comment

by:faithless1
ID: 35178451
Thanks, that worked! One other question I have is how I would include the entire site and not just the home page (it's currently only including the index page). Thank you.
0
 
LVL 31

Assisted Solution

by:farzanj
farzanj earned 500 total points
ID: 35180037
Please read the following docs.

http://daniel.haxx.se/docs/curl-vs-wget.html
http://williamjxj.wordpress.com/2010/12/17/curl-vs-wget-vs-lynx/
http://linux.die.net/man/1/lynx

You can try wget instead of lynx.  That would download recursively, except when wget is restricted by the web host.
I think the easiest way to do it, would be to know your complete URLS.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now