Solved

Scraping Adwords Sonspered Search Listings.  Is it possilbe

Posted on 2010-09-17
20
456 Views
Last Modified: 2012-05-10
Hi,

Is it now possible to scrape sponsored search listings?  

I am clueless about the JS encrypting.  It now seens that it is not possible.  Is to, how to companies like Keyword spy get there data?  When did google change the page  encrypting?  Understood that it breaks the TOS but just want to find out how companies like keyword spy get there date unless using manual labor.

Thanks  
0
Comment
Question by:dmontgom
  • 14
  • 6
20 Comments
 
LVL 3

Expert Comment

by:T1750
Comment Utility
The easiest way to do it is to use iMacros:

http://www.iopus.com/

Though if you want to call it from python you'll have to shell out a fair bit of money.

The second alternative would be to use a python scraping tool, I'd recommend scrapy or twill (twill is more stable and easier to use) and a javascript interpreter such as:

http://www.mozilla.org/rhino/

A third solution is to simply trace the decryption in your web browser (i.e. Firefox with venkman) then copy-cat it in Python. However if they change their encryption you'd need to re-do your code.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
You could be super naughty and use iMacros normal edition and control it from Python by sending it emulated keypresses.

This will get you started:

http://stackoverflow.com/questions/1262310/simulate-keypress-in-a-linux-c-console-application

I don't really want to help anymore to do that though as it probably violates their TOS and they are a good company.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
Another solution would be to control a real web browser with AutoIt instead of iMacros;

http://www.autoitscript.com/autoit3/index.shtml

Run a small windows VM that does nothing but the scraping and has a Python XMLRPC or similar server where your host machine can get the results.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
And a final solution would be to have your own "real web browser" by using WebKit and integrate that with python.

Any one of these will work.
0
 

Author Comment

by:dmontgom
Comment Utility
Thanks for the comments.  I will evaluate all.  Again...not interested on doing it I just want to know if it is possible.  I am really interested if this is the actual route that companies like keyword spy aquires there data.

Thanks
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
Taking an educated guess guess, I'd say they almost certainly run real web browsers in virtual machines under a hypervisor and use iMacros.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
It's 100% possible and not very hard.
0
 

Author Comment

by:dmontgom
Comment Utility
T17050....

Wow...that would be easy then.  Would not a script like python be easier?  Can it still be done using something like mechanize?
0
 

Author Comment

by:dmontgom
Comment Utility
Well....it does not really do the decryption...iMacros that is.

This is really about how to decrypt.  Just try doing a veiw page source when you do a google search
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
You're missing a couple of points. They almost certainly do it with iMacros (or maybe autoit) in a vm farm because:

1) It's fast and easy to setup.
2) That IS how they do the decrypt, they just let the browser do it for them
3) If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it doesn't matter because the browser will still decrypt it.

I could have a setup like that going in under a day, and python would be driving iMacros, or autoit would be driving python. Had I known there was such an easy business opportunity there I might have set up such a service myself!
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 3

Expert Comment

by:T1750
Comment Utility
Point 3 above was meant to read "If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it would have been a waste of effort if they copy-catted it in python, they'd have to do it again, using the autoit/imacros methods it doesn't matter at all because the browser will still decrypt it.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
I think you may not understand how easy it is to control virtual machine farms, you can have scripts turning them on and off as needed, bringing them online and offline as needed and it's not hard work, around here $500 would buy you a lot of second hand p4 desktops which will work fine as slaves to host several stripped bare OS installs (probably windows xp ripped to shreds with nLite) with tiny ram allocations just doing the scraping and reporting. At a decent coders hourly rate it would cost them a hell of a lot more to do it any other method, while other methods are possible, it's simply not sensible.

Think about it.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
Just to clarify: They are not sitting there watching the screens, the VMs are all running in the background doing their duty, and there are not rows of monitors of web pages being viewed. The only time they even look inside a VM is when it reports it has an exception or has stopped responding for some reason, then they adjust their code to fix the issue so that in future it isn't repeated.
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
And one more clarification, the MASSIVE advantage of iMacros/Autoit vm farms being controlled by python scripts rather than directly scraping with python is for adwords to be visible to a user they have to appear in the browser. So the browser is ALWAYS going to be able to get the adwords no matter what happens, no matter what changes in future, once you set it up the only thing you ever gunna have to change is maybe a couple of dom id's every now and again if they move stuff about.
0
 

Author Comment

by:dmontgom
Comment Utility
yes...You can actually to this on AWS EC2 widnows instanense but still....you have to save the ads to a file.  One would have to parse the html and save to a database.  That I dont get.  Or am I missing something?
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
No, you (mostly) avoid parsing the HTML and read the ad-words from the box they are in. iMacros offers a very high-level way of parsing the HTML if you like, you do still need to tell it where to get the data from and yes you need to tell it to store it in a database, but it doesn't really parse the HTML and JS at all, the browser does and when it's done decrypting iMacros just has to read what's in the box the decryption produced which is very, very simples.
0
 

Author Comment

by:dmontgom
Comment Utility
No reponse from Imaros.  Dont they they cna do it
0
 
LVL 3

Expert Comment

by:T1750
Comment Utility
I've used both approaches to scraping and recommend letting the browser do the work. For a static site it makes sense do DIY with mechanize, twill, scrapy, whatever, but if they are encrypting and obfuscating code you will save yourself headache buy running the most natural experience possible and just taking from the browser whatever they chose to do today. No more comments from me in this thread, you know two ways to do it and you know which I think is sensible for someone who is trying to stop you doing it. iMacros can be controlled with auto-it if you're not getting any response (hard to believe they've been very responsive with me).

You know what cards you hold now place your bet.
0
 
LVL 3

Accepted Solution

by:
T1750 earned 125 total points
Comment Utility
Here's a free gift if you insist on doing it wrong:

http://jsunpack.jeek.org/dec/go
0
 

Author Closing Comment

by:dmontgom
Comment Utility
No soltuion
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

[Part 6 of a 6 part series called SEO Basics: 5 SEO Secrets for Creating Content that Drives Traffic (http://www.experts-exchange.com/Web_Development/Internet_Marketing/Search_Engine_Optimization_SEO/A_8369-SEO-Basics-5-SEO-Secrets-for-Creating-Cont…
A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now