Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Scraping Adwords Sonspered Search Listings.  Is it possilbe

Posted on 2010-09-17
20
Medium Priority
?
486 Views
Last Modified: 2012-05-10
Hi,

Is it now possible to scrape sponsored search listings?  

I am clueless about the JS encrypting.  It now seens that it is not possible.  Is to, how to companies like Keyword spy get there data?  When did google change the page  encrypting?  Understood that it breaks the TOS but just want to find out how companies like keyword spy get there date unless using manual labor.

Thanks  
0
Comment
Question by:dmontgom
  • 14
  • 6
20 Comments
 
LVL 3

Expert Comment

by:T1750
ID: 33711590
The easiest way to do it is to use iMacros:

http://www.iopus.com/

Though if you want to call it from python you'll have to shell out a fair bit of money.

The second alternative would be to use a python scraping tool, I'd recommend scrapy or twill (twill is more stable and easier to use) and a javascript interpreter such as:

http://www.mozilla.org/rhino/

A third solution is to simply trace the decryption in your web browser (i.e. Firefox with venkman) then copy-cat it in Python. However if they change their encryption you'd need to re-do your code.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711600
You could be super naughty and use iMacros normal edition and control it from Python by sending it emulated keypresses.

This will get you started:

http://stackoverflow.com/questions/1262310/simulate-keypress-in-a-linux-c-console-application

I don't really want to help anymore to do that though as it probably violates their TOS and they are a good company.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711639
Another solution would be to control a real web browser with AutoIt instead of iMacros;

http://www.autoitscript.com/autoit3/index.shtml

Run a small windows VM that does nothing but the scraping and has a Python XMLRPC or similar server where your host machine can get the results.
0
How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

 
LVL 3

Expert Comment

by:T1750
ID: 33711656
And a final solution would be to have your own "real web browser" by using WebKit and integrate that with python.

Any one of these will work.
0
 

Author Comment

by:dmontgom
ID: 33713315
Thanks for the comments.  I will evaluate all.  Again...not interested on doing it I just want to know if it is possible.  I am really interested if this is the actual route that companies like keyword spy aquires there data.

Thanks
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723007
Taking an educated guess guess, I'd say they almost certainly run real web browsers in virtual machines under a hypervisor and use iMacros.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723014
It's 100% possible and not very hard.
0
 

Author Comment

by:dmontgom
ID: 33786972
T17050....

Wow...that would be easy then.  Would not a script like python be easier?  Can it still be done using something like mechanize?
0
 

Author Comment

by:dmontgom
ID: 33787019
Well....it does not really do the decryption...iMacros that is.

This is really about how to decrypt.  Just try doing a veiw page source when you do a google search
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789271
You're missing a couple of points. They almost certainly do it with iMacros (or maybe autoit) in a vm farm because:

1) It's fast and easy to setup.
2) That IS how they do the decrypt, they just let the browser do it for them
3) If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it doesn't matter because the browser will still decrypt it.

I could have a setup like that going in under a day, and python would be driving iMacros, or autoit would be driving python. Had I known there was such an easy business opportunity there I might have set up such a service myself!
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789293
Point 3 above was meant to read "If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it would have been a waste of effort if they copy-catted it in python, they'd have to do it again, using the autoit/imacros methods it doesn't matter at all because the browser will still decrypt it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789364
I think you may not understand how easy it is to control virtual machine farms, you can have scripts turning them on and off as needed, bringing them online and offline as needed and it's not hard work, around here $500 would buy you a lot of second hand p4 desktops which will work fine as slaves to host several stripped bare OS installs (probably windows xp ripped to shreds with nLite) with tiny ram allocations just doing the scraping and reporting. At a decent coders hourly rate it would cost them a hell of a lot more to do it any other method, while other methods are possible, it's simply not sensible.

Think about it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789421
Just to clarify: They are not sitting there watching the screens, the VMs are all running in the background doing their duty, and there are not rows of monitors of web pages being viewed. The only time they even look inside a VM is when it reports it has an exception or has stopped responding for some reason, then they adjust their code to fix the issue so that in future it isn't repeated.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789471
And one more clarification, the MASSIVE advantage of iMacros/Autoit vm farms being controlled by python scripts rather than directly scraping with python is for adwords to be visible to a user they have to appear in the browser. So the browser is ALWAYS going to be able to get the adwords no matter what happens, no matter what changes in future, once you set it up the only thing you ever gunna have to change is maybe a couple of dom id's every now and again if they move stuff about.
0
 

Author Comment

by:dmontgom
ID: 33789756
yes...You can actually to this on AWS EC2 widnows instanense but still....you have to save the ads to a file.  One would have to parse the html and save to a database.  That I dont get.  Or am I missing something?
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789924
No, you (mostly) avoid parsing the HTML and read the ad-words from the box they are in. iMacros offers a very high-level way of parsing the HTML if you like, you do still need to tell it where to get the data from and yes you need to tell it to store it in a database, but it doesn't really parse the HTML and JS at all, the browser does and when it's done decrypting iMacros just has to read what's in the box the decryption produced which is very, very simples.
0
 

Author Comment

by:dmontgom
ID: 33923597
No reponse from Imaros.  Dont they they cna do it
0
 
LVL 3

Expert Comment

by:T1750
ID: 33925892
I've used both approaches to scraping and recommend letting the browser do the work. For a static site it makes sense do DIY with mechanize, twill, scrapy, whatever, but if they are encrypting and obfuscating code you will save yourself headache buy running the most natural experience possible and just taking from the browser whatever they chose to do today. No more comments from me in this thread, you know two ways to do it and you know which I think is sensible for someone who is trying to stop you doing it. iMacros can be controlled with auto-it if you're not getting any response (hard to believe they've been very responsive with me).

You know what cards you hold now place your bet.
0
 
LVL 3

Accepted Solution

by:
T1750 earned 250 total points
ID: 33925899
Here's a free gift if you insist on doing it wrong:

http://jsunpack.jeek.org/dec/go
0
 

Author Closing Comment

by:dmontgom
ID: 34066330
No soltuion
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This code takes an Excel list of URL’s and adds a header titled “URL List”. It then searches through all URL’s in column “A”, looking for duplicates. When a duplicate is found, it is moved to the top of the list. The duplicate URL’s are then highlig…
There is a massive demand for content on the web right now, and it doesn't look like it's going to stop any time soon. But, if you are running a business blog, it's not just enough to offer your audience lots of content. It needs to be high-quality…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
This Micro Tutorial will demonstrate how to add subdomains to your content reports. This can be very importing in having a site with multiple subdomains.

783 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question