How can I identify/stop a screen scraper

gdemaria
gdemaria used Ask the Experts™
on
Hi Experts,

Our subscription service allows website owners to display our pages on their websites.  The pages contain industry-specific information that is hard to come by, so we collect and make it available to website owners.

I am noticing a lot of activity and errors coming through one of our feeds.   The errors are unusual such as undefined variables, dropped sessions and such.  The errors are only happening via this one feed and the owners of the feed are not complaining of seeing any errors.   I am suspicious that we have someone out there running code to scrape our screens and pull our data.

My question is two-part:

Is there some way I can tell if its a human or software accessing the site?  

Is there a way to prevent this (excluding captcha which would not be practical as we don't form submits and can't ask a user for this every time they view a page).  

Thanks for any help!


I am using MS IIs 6, Windows 2003 server, Coldfusion 8
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
A couple things you can do:

1) Monitor traffic from IP addresses and see if there is one that has an unusually high load or number of errors and block it.

2) Have each group that users your information pass in a querystring of a GUID and make sure that it's a valid GUID before you server them the information

3) Have your users send their IP address and only allow those IPs
Thanks strickdd.

Regarding IP address.  I see the errors always coming from the same IP address.  I blocked it, but then the people working in that company (our subscribers) could no longer get access.  So it seems the source is within that network, but when I spoke to people there they had no idea about any software that may be running.  I had to turn them back on.   If I can prove they are doing something, then I can cut them off.

The GUID sounds like an interesting idea, but I'm not sure how to implement it.  We give each subscribing company a URL, if that URL has an ID on it, then everyone would have that ID and it would be static.  I guess I need a way to generate a GUID each time the code is run.  Perhaps instead of a link, I can give them some javascript to run, like google analytics does or something like that??  Thoughts?

Thanks again!!
Richard QuadlingSenior Software Developer

Commented:
It may be that if your client has a service to pass on the requests, that they are being scraped and the requests passed to you.

If you believe that your system is being compromised, then you can take the steps to terminate the connection, or at least tell them that the requests from their system are not in compliance with the agreement (assuming you have one).

Don't make it threatening or legalese, just say that you are noticing erratic behaviour and that may indicate a problem at their end.

If you can provide them with usage logs and errors, they are more likely to listen.

If this service costs a lot of money and that they require the service, then this may lead to a solution.


One solution, is to simply NOT feed anything unless the request is 100% accurate.

So, missing a property on the request?

die('Illegal request');

Property values not in expected range?

die('Illegal request');


For those within the rules, nothing changes.

For those outside the rules, they get nothing. No explanation. Nothing.

Also, make sure you set ...

ini_set('display_errors', 0);

in your code so nothing is output when the junk data comes in.

Ideally, use full exception handling and/or error checking of everything. Don't allow the junk to get to your code.
How to Generate Services Revenue the Easiest Way

This Tuesday! Learn key insights about modern cyber protection services & gain practical strategies to skyrocket business:

- What it takes to build a cloud service portfolio
- How to determine which services will help your unique business grow
- Various use-cases and examples

what web server do you use?

If you use apache, then maybe you can block specific user agents, assuming the scraper does not masquerade as a valid browser.

http://httpd.apache.org/docs/2.0/misc/rewriteguide.html#access

To see what kind of agent string is provided by the rogue site, you have to configure your logs properly.

ShalomC
Thanks for the comments!


RQuadling - I like the idea of blocking any failed attempts, but I am assuming that this scraping is being successful and failing at random times.  For example, I am getting an error one out of every 50 requests as it randomly provides an invalid string or loses it's session.     I can certainly kill the error ones, but how can I identify it so that I can kill the 49 success ones?


Shalomc - great idea about the user agent,  below is the user agent, I don't know much about them, but this seems pretty ordinary to me, your thoughts?   (I am using windows 2003, IIS 6)

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6.3; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618; MS-RTC LM 8; CT2206653_5.0.1.3)



Also, any suggestions on prevention?   Is there a way to open a URL on my website to ensure it is coming from their website only?    An IP based solution won't work.  

Perhaps some javascript code that authenticates somehow and creates the session?
Richard QuadlingSenior Software Developer

Commented:
Without some idea of what the requests/responses look like, we can only guess.

There is next to nothing you can rely on from a user that will prove anything.

That is why screen scraping is so easy.

Can you give some more detail about how the system works?



One option.

Tell this particular client that there has been a change to the URL...

Supply them a URL which is slightly different.

Use mod_rewrite to rewrite that url back to the proper one.

And use another rule based upon their public IP to block requests from the old URL with that IP.


Without knowing more about what is happening, this is just guess work.

Senior Software Developer
Commented:
"Is there a way to open a URL on my website to ensure it is coming from their website only?"

You would normally look at $_SERVER['HTTP_REFERER'] (yes REFERER and NOT REFERRER - http://en.wikipedia.org/wiki/HTTP_referrer), but that is just another header on the request.

With PHP and contexts, or PHP and cURL, you can create all of that anyway.

In fact, anything you can do I can do in PHP and contexts or curl.

Without authentication, you are running a public site which pretty much will have to respond to whatever it is asked to do.

If you have authentication, is it secure? Inform all users of a new password scheme requiring a new password.

Block all accounts which haven't got a new password after a week (or whatever time period is suitable).

If the scrapper (if that is what it is) is using old details, they'll soon be blocked.



Can you show the logs that you are concerned about?
> Without some idea of what the requests/responses look like, we can only guess.

They are nothing special.  The pages simply list information.    The initial request has nothing special on it either, just a URL such as  www.oursite.com/subscriber123/show-data-list.htm

> There is next to nothing you can rely on from a user that will prove anything.
> That is why screen scraping is so easy.

Seems that way!

> Can you give some more detail about how the system works?

Hopefully, the above link helps.  It's very basic.  We have a link www.oursite.com/xxxx/the-page.html   where xxx is the subscriber and it is followed by the page they want to see.  They put this link in their iFrame and they see the data..



>  Tell this particular client that there has been a change to the URL...
>  Supply them a URL which is slightly different.

Thanks, I am doing this now.


> Referrer

We have looked at the referer, but it is not consistent.  It is too often empty and it's value varies on how the subscriber chooses to launch our page

>  If you have authentication, is it secure? Inform all users of a new password scheme requiring a new password.

Users do not get a username, they have open access to the page.   We are only trying to control which website can display our page.    Websites do not have username/passwords, only the unique URL.  That is the goal here, to add some control.


I have seen this done somewhere, I think using javascript, I have to keep searching for that example.  


Website A is launches a page from website B, but if website C attempts to launch the same page on B, it cannot.   Also, if you try to go directly to B, you cannot.   You must go to website A to get to website B's pages.    But once you have access to B, you can navigate freely on B until your session expires.

In a nut shell..







Richard QuadlingSenior Software Developer

Commented:
But whatever Site A does, so can I. Unless Site A talks to a script on Site A first. A proxy.
> Unless Site A talks to a script on Site A first. A proxy

Right, what could this look like?    

Perhaps a script run by Site A that places a token (cookie) in the browser allowing the session.
The token comes from site B when site A requests authenticated.   Site A uses javascript (the only language we are confident site A can run) to call site B, pass in it's domain and get back the token??

I dunno...


Richard QuadlingSenior Software Developer

Commented:
Ah. I've just read through all of this.

In this film there is...

Me and my browser
Site A (your client)
Site B (you)

My browser makes a request to Site A.

It gets a bunch of HTML from Site A.

In the HTML there is an IFRAME to Site B.

My browser makes a request to Site B.

Site A does _NOT_ request the data from Site B.

My browser does.

So, if someone is scraping, it is a browser _AT_ site A.

Ted BouskillSenior Software Developer
Top Expert 2009

Commented:
With code I can access the full HTML of any site and it can't be stopped.  Ultimately you will send HTML in a stream I can read and I can simulate cookies and referrer pages.  Even IFRAME's a easy to manage.  Oh and I know how to simulate any user agent.

If you monitor the logs closely you might see my guesses but if I'm careful with my browser first you would never know I'm doing it.

Sorry but without authentication the web is truly anonymous.
Richard QuadlingSenior Software Developer

Commented:
100% agree with tedbilly.

To put it another way ...

Anything you do that DOESN'T enforce authentication, is a waste of time. It is 100% bypassable.

Providing your data so it can be displayed in an iframe means the end user is accessing the data, not the host providing the iframe.


gd
I only read partially through the posts here but am posting because this may be a similar issue to what I experience.
And thought I'd post general notes on what I do to solve it.

1) function to determine if a user is a BOT or a human
user agent isn't effective because one can identify themselves as any user agent
you then set a cookie (cookie.isBot = 1 or 0) based on if a bot or not
expires = never so that human users don't need to call the function with every page request even if bots do

2) concerning sessions
no reason to set sessions for bots as they acquire a new session with each request
so in onSessionStart()
<cfif Not StructKeyExists(cookie,"isBot")>
      getBot function
</cfif>
<cfif Not cookie.isBot>
      call whatever session stuff you need
</cfif>

3) and then on page code
since no session vars are set you'd need to code like
<cfif Not cookie.isBot>
      #session.stuff#
<cfelse>
      Hello Bot
</cfif>

To get to the point
when i receive error emails from a site and they are confusing as that they don't make sense as to why they are happening
I just google the ip address (which is in the error email) and usually they are a known "bad" bot.
I just then add them to a bot list - then that entity is set as a bot.

Now I see your problem where the offender is coming from within a network of customers
In your error emails are you saving the query string and dumping the session and or form scope?
Might be able to determine a few things by looking at those.

In this case though you may need to move this over to a system where customers need to login and data cannot be accessed unless the useragent is logged in.

just thought i'd post ...
good luck ...
Thanks dgrafx, you bring up a good point, I should just try to block bots.. except how can you tell?

> function to determine if a user is a BOT or a human

Does this exist?   That's why they have captcha, because the browser can't tell if it's a human or not.. .unless they play nice and identify themselves such as the Google bot..



tedbilly,

You're most likely correct, the mission is futile for 100% security.

However, it is still a quest worth pursuing.

Let's then say my mission is to make it difficult, someone needs to have some tech savvy to access our pages.

Google Maps is doing it, not sure how secure their approach is, but that's the essence of what I want to accomplish

>>function to determine if a user is a BOT or a human

I thought about it and decided I didn't want to post my "classified" secrets here :)
even though they are not worthy of a Nobel prize I see no reason to post code that could possibly make my work harder ...

I will email you hopefully sometime today with an explanation

essentially though i do modify it now and then
as new gen bots come out and do "new stuff" I need to counter with code modifications



Ted BouskillSenior Software Developer
Top Expert 2009

Commented:
@dgrafx: If you finish it tell me where to look.  I'll bet I can get past it.

@gdemaria: Have you considered the problem might not be a problem?  For example Microsoft ISA server has a feature that it can prefetch at regular intervals frequently visited sites to warm the cache for users.  Other products have that feature.  Some also will try to refresh the cache before the next user asks for it.  Maybe what appears to be malicious is actually completely innocent.

Ted BouskillSenior Software Developer
Top Expert 2009

Commented:
http:#26472009 is correct.  A web server cannot executing a turing test without using CAPTCHA (and cannot be used according to the asker)
Thanks for the ideas folks, I will continue to work on a mechanism to restrict access to the site without authentication.  I understand this cannot be 100%, but I don't need it to that level.  After all, it's completely open right now, anything is better.  The model I am looking into is along the lines of placing google maps on your site using an encrypted key

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial