Link to home
Start Free TrialLog in
Avatar of gdemaria
gdemariaFlag for United States of America

asked on

How can I identify/stop a screen scraper

Hi Experts,

Our subscription service allows website owners to display our pages on their websites.  The pages contain industry-specific information that is hard to come by, so we collect and make it available to website owners.

I am noticing a lot of activity and errors coming through one of our feeds.   The errors are unusual such as undefined variables, dropped sessions and such.  The errors are only happening via this one feed and the owners of the feed are not complaining of seeing any errors.   I am suspicious that we have someone out there running code to scrape our screens and pull our data.

My question is two-part:

Is there some way I can tell if its a human or software accessing the site?  

Is there a way to prevent this (excluding captcha which would not be practical as we don't form submits and can't ask a user for this every time they view a page).  

Thanks for any help!


I am using MS IIs 6, Windows 2003 server, Coldfusion 8
Avatar of strickdd
strickdd
Flag of United States of America image

A couple things you can do:

1) Monitor traffic from IP addresses and see if there is one that has an unusually high load or number of errors and block it.

2) Have each group that users your information pass in a querystring of a GUID and make sure that it's a valid GUID before you server them the information

3) Have your users send their IP address and only allow those IPs
Avatar of gdemaria

ASKER

Thanks strickdd.

Regarding IP address.  I see the errors always coming from the same IP address.  I blocked it, but then the people working in that company (our subscribers) could no longer get access.  So it seems the source is within that network, but when I spoke to people there they had no idea about any software that may be running.  I had to turn them back on.   If I can prove they are doing something, then I can cut them off.

The GUID sounds like an interesting idea, but I'm not sure how to implement it.  We give each subscribing company a URL, if that URL has an ID on it, then everyone would have that ID and it would be static.  I guess I need a way to generate a GUID each time the code is run.  Perhaps instead of a link, I can give them some javascript to run, like google analytics does or something like that??  Thoughts?

Thanks again!!
It may be that if your client has a service to pass on the requests, that they are being scraped and the requests passed to you.

If you believe that your system is being compromised, then you can take the steps to terminate the connection, or at least tell them that the requests from their system are not in compliance with the agreement (assuming you have one).

Don't make it threatening or legalese, just say that you are noticing erratic behaviour and that may indicate a problem at their end.

If you can provide them with usage logs and errors, they are more likely to listen.

If this service costs a lot of money and that they require the service, then this may lead to a solution.


One solution, is to simply NOT feed anything unless the request is 100% accurate.

So, missing a property on the request?

die('Illegal request');

Property values not in expected range?

die('Illegal request');


For those within the rules, nothing changes.

For those outside the rules, they get nothing. No explanation. Nothing.

Also, make sure you set ...

ini_set('display_errors', 0);

in your code so nothing is output when the junk data comes in.

Ideally, use full exception handling and/or error checking of everything. Don't allow the junk to get to your code.
what web server do you use?

If you use apache, then maybe you can block specific user agents, assuming the scraper does not masquerade as a valid browser.

http://httpd.apache.org/docs/2.0/misc/rewriteguide.html#access

To see what kind of agent string is provided by the rogue site, you have to configure your logs properly.

ShalomC
Thanks for the comments!


RQuadling - I like the idea of blocking any failed attempts, but I am assuming that this scraping is being successful and failing at random times.  For example, I am getting an error one out of every 50 requests as it randomly provides an invalid string or loses it's session.     I can certainly kill the error ones, but how can I identify it so that I can kill the 49 success ones?


Shalomc - great idea about the user agent,  below is the user agent, I don't know much about them, but this seems pretty ordinary to me, your thoughts?   (I am using windows 2003, IIS 6)

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6.3; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618; MS-RTC LM 8; CT2206653_5.0.1.3)



Also, any suggestions on prevention?   Is there a way to open a URL on my website to ensure it is coming from their website only?    An IP based solution won't work.  

Perhaps some javascript code that authenticates somehow and creates the session?
Without some idea of what the requests/responses look like, we can only guess.

There is next to nothing you can rely on from a user that will prove anything.

That is why screen scraping is so easy.

Can you give some more detail about how the system works?



One option.

Tell this particular client that there has been a change to the URL...

Supply them a URL which is slightly different.

Use mod_rewrite to rewrite that url back to the proper one.

And use another rule based upon their public IP to block requests from the old URL with that IP.


Without knowing more about what is happening, this is just guess work.

ASKER CERTIFIED SOLUTION
Avatar of Richard Quadling
Richard Quadling
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
> Without some idea of what the requests/responses look like, we can only guess.

They are nothing special.  The pages simply list information.    The initial request has nothing special on it either, just a URL such as  www.oursite.com/subscriber123/show-data-list.htm

> There is next to nothing you can rely on from a user that will prove anything.
> That is why screen scraping is so easy.

Seems that way!

> Can you give some more detail about how the system works?

Hopefully, the above link helps.  It's very basic.  We have a link www.oursite.com/xxxx/the-page.html   where xxx is the subscriber and it is followed by the page they want to see.  They put this link in their iFrame and they see the data..



>  Tell this particular client that there has been a change to the URL...
>  Supply them a URL which is slightly different.

Thanks, I am doing this now.


> Referrer

We have looked at the referer, but it is not consistent.  It is too often empty and it's value varies on how the subscriber chooses to launch our page

>  If you have authentication, is it secure? Inform all users of a new password scheme requiring a new password.

Users do not get a username, they have open access to the page.   We are only trying to control which website can display our page.    Websites do not have username/passwords, only the unique URL.  That is the goal here, to add some control.


I have seen this done somewhere, I think using javascript, I have to keep searching for that example.  


Website A is launches a page from website B, but if website C attempts to launch the same page on B, it cannot.   Also, if you try to go directly to B, you cannot.   You must go to website A to get to website B's pages.    But once you have access to B, you can navigate freely on B until your session expires.

In a nut shell..







But whatever Site A does, so can I. Unless Site A talks to a script on Site A first. A proxy.
> Unless Site A talks to a script on Site A first. A proxy

Right, what could this look like?    

Perhaps a script run by Site A that places a token (cookie) in the browser allowing the session.
The token comes from site B when site A requests authenticated.   Site A uses javascript (the only language we are confident site A can run) to call site B, pass in it's domain and get back the token??

I dunno...


Ah. I've just read through all of this.

In this film there is...

Me and my browser
Site A (your client)
Site B (you)

My browser makes a request to Site A.

It gets a bunch of HTML from Site A.

In the HTML there is an IFRAME to Site B.

My browser makes a request to Site B.

Site A does _NOT_ request the data from Site B.

My browser does.

So, if someone is scraping, it is a browser _AT_ site A.

With code I can access the full HTML of any site and it can't be stopped.  Ultimately you will send HTML in a stream I can read and I can simulate cookies and referrer pages.  Even IFRAME's a easy to manage.  Oh and I know how to simulate any user agent.

If you monitor the logs closely you might see my guesses but if I'm careful with my browser first you would never know I'm doing it.

Sorry but without authentication the web is truly anonymous.
100% agree with tedbilly.

To put it another way ...

Anything you do that DOESN'T enforce authentication, is a waste of time. It is 100% bypassable.

Providing your data so it can be displayed in an iframe means the end user is accessing the data, not the host providing the iframe.


SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks dgrafx, you bring up a good point, I should just try to block bots.. except how can you tell?

> function to determine if a user is a BOT or a human

Does this exist?   That's why they have captcha, because the browser can't tell if it's a human or not.. .unless they play nice and identify themselves such as the Google bot..



tedbilly,

You're most likely correct, the mission is futile for 100% security.

However, it is still a quest worth pursuing.

Let's then say my mission is to make it difficult, someone needs to have some tech savvy to access our pages.

Google Maps is doing it, not sure how secure their approach is, but that's the essence of what I want to accomplish

>>function to determine if a user is a BOT or a human

I thought about it and decided I didn't want to post my "classified" secrets here :)
even though they are not worthy of a Nobel prize I see no reason to post code that could possibly make my work harder ...

I will email you hopefully sometime today with an explanation

essentially though i do modify it now and then
as new gen bots come out and do "new stuff" I need to counter with code modifications



@dgrafx: If you finish it tell me where to look.  I'll bet I can get past it.

@gdemaria: Have you considered the problem might not be a problem?  For example Microsoft ISA server has a feature that it can prefetch at regular intervals frequently visited sites to warm the cache for users.  Other products have that feature.  Some also will try to refresh the cache before the next user asks for it.  Maybe what appears to be malicious is actually completely innocent.

http:#26472009 is correct.  A web server cannot executing a turing test without using CAPTCHA (and cannot be used according to the asker)
Thanks for the ideas folks, I will continue to work on a mechanism to restrict access to the site without authentication.  I understand this cannot be 100%, but I don't need it to that level.  After all, it's completely open right now, anything is better.  The model I am looking into is along the lines of placing google maps on your site using an encrypted key