Solved

Scraping Adwords Sonspered Search Listings.  Is it possilbe

Posted on 2010-09-17
20
462 Views
Last Modified: 2012-05-10
Hi,

Is it now possible to scrape sponsored search listings?  

I am clueless about the JS encrypting.  It now seens that it is not possible.  Is to, how to companies like Keyword spy get there data?  When did google change the page  encrypting?  Understood that it breaks the TOS but just want to find out how companies like keyword spy get there date unless using manual labor.

Thanks  
0
Comment
Question by:dmontgom
  • 14
  • 6
20 Comments
 
LVL 3

Expert Comment

by:T1750
ID: 33711590
The easiest way to do it is to use iMacros:

http://www.iopus.com/

Though if you want to call it from python you'll have to shell out a fair bit of money.

The second alternative would be to use a python scraping tool, I'd recommend scrapy or twill (twill is more stable and easier to use) and a javascript interpreter such as:

http://www.mozilla.org/rhino/

A third solution is to simply trace the decryption in your web browser (i.e. Firefox with venkman) then copy-cat it in Python. However if they change their encryption you'd need to re-do your code.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711600
You could be super naughty and use iMacros normal edition and control it from Python by sending it emulated keypresses.

This will get you started:

http://stackoverflow.com/questions/1262310/simulate-keypress-in-a-linux-c-console-application

I don't really want to help anymore to do that though as it probably violates their TOS and they are a good company.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711639
Another solution would be to control a real web browser with AutoIt instead of iMacros;

http://www.autoitscript.com/autoit3/index.shtml

Run a small windows VM that does nothing but the scraping and has a Python XMLRPC or similar server where your host machine can get the results.
0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 3

Expert Comment

by:T1750
ID: 33711656
And a final solution would be to have your own "real web browser" by using WebKit and integrate that with python.

Any one of these will work.
0
 

Author Comment

by:dmontgom
ID: 33713315
Thanks for the comments.  I will evaluate all.  Again...not interested on doing it I just want to know if it is possible.  I am really interested if this is the actual route that companies like keyword spy aquires there data.

Thanks
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723007
Taking an educated guess guess, I'd say they almost certainly run real web browsers in virtual machines under a hypervisor and use iMacros.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723014
It's 100% possible and not very hard.
0
 

Author Comment

by:dmontgom
ID: 33786972
T17050....

Wow...that would be easy then.  Would not a script like python be easier?  Can it still be done using something like mechanize?
0
 

Author Comment

by:dmontgom
ID: 33787019
Well....it does not really do the decryption...iMacros that is.

This is really about how to decrypt.  Just try doing a veiw page source when you do a google search
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789271
You're missing a couple of points. They almost certainly do it with iMacros (or maybe autoit) in a vm farm because:

1) It's fast and easy to setup.
2) That IS how they do the decrypt, they just let the browser do it for them
3) If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it doesn't matter because the browser will still decrypt it.

I could have a setup like that going in under a day, and python would be driving iMacros, or autoit would be driving python. Had I known there was such an easy business opportunity there I might have set up such a service myself!
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789293
Point 3 above was meant to read "If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it would have been a waste of effort if they copy-catted it in python, they'd have to do it again, using the autoit/imacros methods it doesn't matter at all because the browser will still decrypt it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789364
I think you may not understand how easy it is to control virtual machine farms, you can have scripts turning them on and off as needed, bringing them online and offline as needed and it's not hard work, around here $500 would buy you a lot of second hand p4 desktops which will work fine as slaves to host several stripped bare OS installs (probably windows xp ripped to shreds with nLite) with tiny ram allocations just doing the scraping and reporting. At a decent coders hourly rate it would cost them a hell of a lot more to do it any other method, while other methods are possible, it's simply not sensible.

Think about it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789421
Just to clarify: They are not sitting there watching the screens, the VMs are all running in the background doing their duty, and there are not rows of monitors of web pages being viewed. The only time they even look inside a VM is when it reports it has an exception or has stopped responding for some reason, then they adjust their code to fix the issue so that in future it isn't repeated.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789471
And one more clarification, the MASSIVE advantage of iMacros/Autoit vm farms being controlled by python scripts rather than directly scraping with python is for adwords to be visible to a user they have to appear in the browser. So the browser is ALWAYS going to be able to get the adwords no matter what happens, no matter what changes in future, once you set it up the only thing you ever gunna have to change is maybe a couple of dom id's every now and again if they move stuff about.
0
 

Author Comment

by:dmontgom
ID: 33789756
yes...You can actually to this on AWS EC2 widnows instanense but still....you have to save the ads to a file.  One would have to parse the html and save to a database.  That I dont get.  Or am I missing something?
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789924
No, you (mostly) avoid parsing the HTML and read the ad-words from the box they are in. iMacros offers a very high-level way of parsing the HTML if you like, you do still need to tell it where to get the data from and yes you need to tell it to store it in a database, but it doesn't really parse the HTML and JS at all, the browser does and when it's done decrypting iMacros just has to read what's in the box the decryption produced which is very, very simples.
0
 

Author Comment

by:dmontgom
ID: 33923597
No reponse from Imaros.  Dont they they cna do it
0
 
LVL 3

Expert Comment

by:T1750
ID: 33925892
I've used both approaches to scraping and recommend letting the browser do the work. For a static site it makes sense do DIY with mechanize, twill, scrapy, whatever, but if they are encrypting and obfuscating code you will save yourself headache buy running the most natural experience possible and just taking from the browser whatever they chose to do today. No more comments from me in this thread, you know two ways to do it and you know which I think is sensible for someone who is trying to stop you doing it. iMacros can be controlled with auto-it if you're not getting any response (hard to believe they've been very responsive with me).

You know what cards you hold now place your bet.
0
 
LVL 3

Accepted Solution

by:
T1750 earned 125 total points
ID: 33925899
Here's a free gift if you insist on doing it wrong:

http://jsunpack.jeek.org/dec/go
0
 

Author Closing Comment

by:dmontgom
ID: 34066330
No soltuion
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to get only tweeted data  from the output file 3 95
Search console 3 86
data scientists and AI 17 108
Python 2.7 - Passing arguments 8 67
Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question