Link to home
Start Free TrialLog in
Avatar of gdemaria
gdemariaFlag for United States of America

asked on

Detect Thief! Screen scrapers

We have a client who has a lot of valuable data on their site.  We have discovered a competitor site is somehow electronically copying the data.  We have tested by making a change to a small data element and within a couple of days the exact change appears on their other site.

We are trying to figure out how to find/track their software that is pulling from our site.

We want to know what URL on our site their may be using (complex reason, but it would help)
We want to be able to track their activity and the source IP address
Once we identify them, we are planning strategies to stop it (such as feeding them bad data)


Thanks!
Avatar of Ron Malmstead
Ron Malmstead
Flag of United States of America image

Can you post the two sites?
Yours and the one that is spoofing you?

That might make it easier to give you the answer.
They could be copying table data or entire elements, or even just using an iframe of your whole site.  Blocking their IP may help temporarily, but then they cold simply proxy it.

Also, if this information is "public" then you're going to have trouble keeping them from doing it.  If it were private, and you could say.. password protect it.. that would be easier obviously.
Put the URL of your client site in Copyscape.com. There is a free and paid version of copyscape.

To stop the scraping you would have to get legal counsel and then contact the owner of the other site who can be found by domain name lookup.

Good luck!
It's very easy to change the IP address that the screen scraping is done from.  Google got tired of it so their solution was to put up search results using javascript which most screen scrapers are unable to run.  Go do a Google search and then look at the "View Source" for the page.
Avatar of gdemaria

ASKER

Thanks for the responses everyone!
@xuserx2000 - this isn't an iFrame, they are pulling specific data elements from our site into their own database and reformatting them as part of their own output.  An example would be scraping Yahoo Finance for Stock quote values and then posting those values on your own site in your own format.

@gkrew - copyscrape will help me identify which website is posting my material.  I already know this.  Thanks.

@DaveBaldwin - just checked google, that's clever.  I was thinking about encoding or something and will try to hunt down some techniques on doing this.   Thanks!   But first, I want to track them and identity them..

@Everyone -
Part of the reason I want to try and track this is because I cannot be sure what IP address the bot is coming from.   If I can see what page they are scraping and what IP address is doing the deed, then I can do something.   Anyone know how to track it?
ASKER CERTIFIED SOLUTION
Avatar of Ron Malmstead
Ron Malmstead
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
'Scraping' looks exactly like someone viewing your page in a browser.  You will not be able to tell the difference unless they are very stupid.  Since I don't think you can successfully track them down, I have to recommend that you find a way to load your pages that can't be scraped.
If you have copyrights, then I would contact my lawyer.  You might also check out their webhost's TOS and send a note to them as well.  That has worked once for me.  You can look in your logs and detect what IP's spend the most time and visits.  

The problem with trying to block them is you would most likely kill your seo so a lawyer is the best way.

Things to try that would hurt seo would include:
Obfuscating the html with js something like document.write(unobfuscate(pagecontent)
Inject javascript in your code <script></script> where inside the script tags you hit a home made web service on your site that can detect the domain it is on and if it is not yours, replace all the text with big bold letters "Illegal content" or even better, auto redirect back to your site.

I wouldn't try and ban IP's because they can be spoofed.
You could also employ captcha, if a client requests a page x number of times in x amount of time.. require captcha input to view the page, if it fails redirect to "sorry no-bot's" page.
I suggested copyscape not copyscrape to see if other sites other than the one you found was also doing it.

Are you willing to explore contacting the owner of the scraper site using the domain name registration information?
Lots of good comments - thanks to everyone!   Not sure many of those are applicable to my situation, I'll try to explain that.

IDs - my data is in a table format, the cells/rows don't have IDs, but I understand the concept of changing things up to make it harder.

Blocking IPs may not work - I understand that they can get around this, but I wouldn't mind trying to find their IP in case they aren't that clever, that way  I could be sure who they are and set up something to monitor.

Scraping looks like any user - really?  there's no difference in user agent or maybe the identification of the browser?

Contact Lawyer - we may very well do this, but it will only help this once.  We were hoping for a mechanism (logs or something) that may help us notice suspicious activity and find the culprit.

Obfuscating the HTML may be a great approach for us provided we can do that to only the data tables and not the text around it.  The data table is useless to search engines anyway (just a bunch of numbers really).  

Inserting javascript into scraped info - I don't think this will work because they are scraping a table of data (like stock quotes).   So they would strip out any alpha characters just to get the numeric data anyway and that would lose the js.

Captcha - can't use it in my case as we don't do form posts to get the database, just go to the page and view it.  

Contacting owner  - we can do this, just working on an approach that would work for the next one who we don't know about.  It's really amazing that we found this one (remember is mostly just data with a little text so it's hard to identify as ours)


Thanks again - any thoughts on how to easily obfuscate the data table and still allow it to be viewed on browsers and mobile devices?
or again...  software I can use to search logs to identify the thief's activity either by frequency of hits, user agents, pages or any other activity..
"my data is in a table format"..
Right, but the <table> element may have a static ID ..(ie. <table id="Stocktable1")
They may simply be taking that ID with  getElementID.. to get the page object.
You could also make "empty" table element right before or after the actual table, ..sometimes have it appear, sometimes not.. which would "trick" their code if all they are looking for is a <table> tag.  They might hit the first table tag and come up empty.

"Captcha - can't use it in my case as we don't do form posts to get the database, just go to the page and view it.   "

Right, but ..you could track if a page is being called over and over again from the same client/IP address within a short period of time, and THEN have a captcha appear... to prove it's human.

I would definitely contact a lawyer though, no matter what.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Scraping looks like any user - really?  there's no difference in user agent or maybe the identification of the browser?

If they're "doing it right", it just looks like a browser visited your page.  If they are lazy, they might be doing it everyday so you can find them because they come from the same IP address all the time.  If they are good and are doing this to others, they will have multiple IP addresses that they use.  They only have to load it once a day or however often you change your page.
Avatar of skullnobrains
skullnobrains


software I can use to search logs to identify the thief's activity either by frequency of hits, user agents, pages or any other activity..

assuming your server logs in apache log format, simple commands will let you see if the attacker is dumb enough to be easily found by ip or user agent

assuming your log is in NCSA format (which is a pain to parse), here are a few examples

ips that produced the greatest number of queries
cat logfile | cut -d ' ' -f 1 | uniq -c | sort -n

same thing hour by hour (will be messed up if you use ipv6 but easy to change)
cat logfile | sed 's/\([0-9.]*\)[^:]*:\([^:]*\).*/\2 \1/'  | uniq -c | sort -n

if you are reasonably fluent with the command line, it should be pretty easy to adapt. we'll help if needed.

other than that, you'll find plenty of log analysers around. awstats is pretty good
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Do you have any updates to share on this?
Thanks for checking back.    My first task is to find who is doing it, then use one of the suggestions to change it (such as ajax), but that will take a little while.

I attempted to place hidden text of the IP address of the scraper into the text field, but the scraper seems to strip out all HTML from the text.

I would really like to find logs to see who is likely scraping, Google Analytics does not seem to go down to the level of individual access so I can't see IP x.x.x.x accessed the site every 2 minutes between midnight and 8 AM... something like that.   I am trying to hunt down a log viewer that may give me that info.. any thoughts?  IIS 5.
You need to use your server stats package to get that info.
"server stats package" for IIS5?  Think Windows 2000.  You should really get out more @padas.
awstats will do the trick on windows as well
http://www.howtogeek.com/50526/setting-up-awstats-on-windows-server-and-iis/

you can also run the commands i provided in cygwin or bash-for-windows but if you're not used to the unix shell, it will likely be difficult to adapt them. i'll give you explanations if you want to go down that road.

if you think it is acceptable to provide a log file in whatever non binary format available, i'm ready to extract the information you want and post back. (W3C format would be great)
Dave, I have been using that package since about 1999/2000 on a shared service.  You would know better about the legacy stuff then I would.   The old version of smarterstats 3.5 will run under iis5 http://smarterstats.en.softonic.com/ (not 100% sure of that link since it is not from the ss site)

If it works, it will tell you what is requested rather then what is served like google.  You can see where people are trying to access even when it is not there (like /wp-admin/).  There is some basic data mining.

Good luck.