?
Solved

Scraping Adwords Sonspered Search Listings.  Is it possilbe

Posted on 2010-09-17
20
Medium Priority
?
479 Views
Last Modified: 2012-05-10
Hi,

Is it now possible to scrape sponsored search listings?  

I am clueless about the JS encrypting.  It now seens that it is not possible.  Is to, how to companies like Keyword spy get there data?  When did google change the page  encrypting?  Understood that it breaks the TOS but just want to find out how companies like keyword spy get there date unless using manual labor.

Thanks  
0
Comment
Question by:dmontgom
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 14
  • 6
20 Comments
 
LVL 3

Expert Comment

by:T1750
ID: 33711590
The easiest way to do it is to use iMacros:

http://www.iopus.com/

Though if you want to call it from python you'll have to shell out a fair bit of money.

The second alternative would be to use a python scraping tool, I'd recommend scrapy or twill (twill is more stable and easier to use) and a javascript interpreter such as:

http://www.mozilla.org/rhino/

A third solution is to simply trace the decryption in your web browser (i.e. Firefox with venkman) then copy-cat it in Python. However if they change their encryption you'd need to re-do your code.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711600
You could be super naughty and use iMacros normal edition and control it from Python by sending it emulated keypresses.

This will get you started:

http://stackoverflow.com/questions/1262310/simulate-keypress-in-a-linux-c-console-application

I don't really want to help anymore to do that though as it probably violates their TOS and they are a good company.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33711639
Another solution would be to control a real web browser with AutoIt instead of iMacros;

http://www.autoitscript.com/autoit3/index.shtml

Run a small windows VM that does nothing but the scraping and has a Python XMLRPC or similar server where your host machine can get the results.
0
[Webinar] Lessons on Recovering from Petya

Skyport is working hard to help customers recover from recent attacks, like the Petya worm. This work has brought to light some important lessons. New malware attacks like this can take down your entire environment. Learn from others mistakes on how to prevent Petya like worms.

 
LVL 3

Expert Comment

by:T1750
ID: 33711656
And a final solution would be to have your own "real web browser" by using WebKit and integrate that with python.

Any one of these will work.
0
 

Author Comment

by:dmontgom
ID: 33713315
Thanks for the comments.  I will evaluate all.  Again...not interested on doing it I just want to know if it is possible.  I am really interested if this is the actual route that companies like keyword spy aquires there data.

Thanks
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723007
Taking an educated guess guess, I'd say they almost certainly run real web browsers in virtual machines under a hypervisor and use iMacros.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33723014
It's 100% possible and not very hard.
0
 

Author Comment

by:dmontgom
ID: 33786972
T17050....

Wow...that would be easy then.  Would not a script like python be easier?  Can it still be done using something like mechanize?
0
 

Author Comment

by:dmontgom
ID: 33787019
Well....it does not really do the decryption...iMacros that is.

This is really about how to decrypt.  Just try doing a veiw page source when you do a google search
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789271
You're missing a couple of points. They almost certainly do it with iMacros (or maybe autoit) in a vm farm because:

1) It's fast and easy to setup.
2) That IS how they do the decrypt, they just let the browser do it for them
3) If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it doesn't matter because the browser will still decrypt it.

I could have a setup like that going in under a day, and python would be driving iMacros, or autoit would be driving python. Had I known there was such an easy business opportunity there I might have set up such a service myself!
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789293
Point 3 above was meant to read "If they reversed the encryption algorithm in-use it and the encryption algorithm gets changed then it would have been a waste of effort if they copy-catted it in python, they'd have to do it again, using the autoit/imacros methods it doesn't matter at all because the browser will still decrypt it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789364
I think you may not understand how easy it is to control virtual machine farms, you can have scripts turning them on and off as needed, bringing them online and offline as needed and it's not hard work, around here $500 would buy you a lot of second hand p4 desktops which will work fine as slaves to host several stripped bare OS installs (probably windows xp ripped to shreds with nLite) with tiny ram allocations just doing the scraping and reporting. At a decent coders hourly rate it would cost them a hell of a lot more to do it any other method, while other methods are possible, it's simply not sensible.

Think about it.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789421
Just to clarify: They are not sitting there watching the screens, the VMs are all running in the background doing their duty, and there are not rows of monitors of web pages being viewed. The only time they even look inside a VM is when it reports it has an exception or has stopped responding for some reason, then they adjust their code to fix the issue so that in future it isn't repeated.
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789471
And one more clarification, the MASSIVE advantage of iMacros/Autoit vm farms being controlled by python scripts rather than directly scraping with python is for adwords to be visible to a user they have to appear in the browser. So the browser is ALWAYS going to be able to get the adwords no matter what happens, no matter what changes in future, once you set it up the only thing you ever gunna have to change is maybe a couple of dom id's every now and again if they move stuff about.
0
 

Author Comment

by:dmontgom
ID: 33789756
yes...You can actually to this on AWS EC2 widnows instanense but still....you have to save the ads to a file.  One would have to parse the html and save to a database.  That I dont get.  Or am I missing something?
0
 
LVL 3

Expert Comment

by:T1750
ID: 33789924
No, you (mostly) avoid parsing the HTML and read the ad-words from the box they are in. iMacros offers a very high-level way of parsing the HTML if you like, you do still need to tell it where to get the data from and yes you need to tell it to store it in a database, but it doesn't really parse the HTML and JS at all, the browser does and when it's done decrypting iMacros just has to read what's in the box the decryption produced which is very, very simples.
0
 

Author Comment

by:dmontgom
ID: 33923597
No reponse from Imaros.  Dont they they cna do it
0
 
LVL 3

Expert Comment

by:T1750
ID: 33925892
I've used both approaches to scraping and recommend letting the browser do the work. For a static site it makes sense do DIY with mechanize, twill, scrapy, whatever, but if they are encrypting and obfuscating code you will save yourself headache buy running the most natural experience possible and just taking from the browser whatever they chose to do today. No more comments from me in this thread, you know two ways to do it and you know which I think is sensible for someone who is trying to stop you doing it. iMacros can be controlled with auto-it if you're not getting any response (hard to believe they've been very responsive with me).

You know what cards you hold now place your bet.
0
 
LVL 3

Accepted Solution

by:
T1750 earned 250 total points
ID: 33925899
Here's a free gift if you insist on doing it wrong:

http://jsunpack.jeek.org/dec/go
0
 

Author Closing Comment

by:dmontgom
ID: 34066330
No soltuion
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Read about how to approach blogging and about ways to do it right. Stand out from the crowd and let your knowledge be consumed by a large audience. This article aims to explain how your blog should look like,  the most important things to do while b…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Suggested Courses

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question