Solved

Detect Thief!  Screen scrapers

Posted on 2013-07-01
23
230 Views
Last Modified: 2014-05-13
We have a client who has a lot of valuable data on their site.  We have discovered a competitor site is somehow electronically copying the data.  We have tested by making a change to a small data element and within a couple of days the exact change appears on their other site.

We are trying to figure out how to find/track their software that is pulling from our site.

We want to know what URL on our site their may be using (complex reason, but it would help)
We want to be able to track their activity and the source IP address
Once we identify them, we are planning strategies to stop it (such as feeding them bad data)


Thanks!
0
Comment
Question by:gdemaria
  • 4
  • 4
  • 4
  • +4
23 Comments
 
LVL 25

Expert Comment

by:Ron M
ID: 39290694
Can you post the two sites?
Yours and the one that is spoofing you?

That might make it easier to give you the answer.
They could be copying table data or entire elements, or even just using an iframe of your whole site.  Blocking their IP may help temporarily, but then they cold simply proxy it.

Also, if this information is "public" then you're going to have trouble keeping them from doing it.  If it were private, and you could say.. password protect it.. that would be easier obviously.
0
 
LVL 9

Expert Comment

by:David Carr
ID: 39290720
Put the URL of your client site in Copyscape.com. There is a free and paid version of copyscape.

To stop the scraping you would have to get legal counsel and then contact the owner of the other site who can be found by domain name lookup.

Good luck!
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39290726
It's very easy to change the IP address that the screen scraping is done from.  Google got tired of it so their solution was to put up search results using javascript which most screen scrapers are unable to run.  Go do a Google search and then look at the "View Source" for the page.
0
 
LVL 39

Author Comment

by:gdemaria
ID: 39290753
Thanks for the responses everyone!
@xuserx2000 - this isn't an iFrame, they are pulling specific data elements from our site into their own database and reformatting them as part of their own output.  An example would be scraping Yahoo Finance for Stock quote values and then posting those values on your own site in your own format.

@gkrew - copyscrape will help me identify which website is posting my material.  I already know this.  Thanks.

@DaveBaldwin - just checked google, that's clever.  I was thinking about encoding or something and will try to hunt down some techniques on doing this.   Thanks!   But first, I want to track them and identity them..

@Everyone -
Part of the reason I want to try and track this is because I cannot be sure what IP address the bot is coming from.   If I can see what page they are scraping and what IP address is doing the deed, then I can do something.   Anyone know how to track it?
0
 
LVL 25

Accepted Solution

by:
Ron M earned 125 total points
ID: 39290775
One thing you could try, is to make the ID's of the elements they are copying dynamic and randomly generated.  Also the containers those elements reside.  You could also put "dummy" elements of the same type which are empty or commented out.  That will make it trickier for the programmers to figure out exactly which elements they want to scrape.

Trying to block by IP is useless, because they could proxy their connection through any other machine on the internet.  Could be a client machine running on DSL.. where the IP will change every so often.  It won't be a reliable way to block them.  There is no way to distinguish their software page request, from any other request from a normal client.. unless they have something identifiable in their request packets.(other than IP)

Again, if this is public information... they can always adapt their code to get what they want.
The example you provided.. stock quotes on yahoo.. people do that often and yahoo knows they are basically powerless to stop it, the only thing they can do is make minor changes in code that "break" their competitors code from time to time (thus rendering their competition unreliable).
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39290819
'Scraping' looks exactly like someone viewing your page in a browser.  You will not be able to tell the difference unless they are very stupid.  Since I don't think you can successfully track them down, I have to recommend that you find a way to load your pages that can't be scraped.
0
 
LVL 52

Expert Comment

by:Scott Fell, EE MVE
ID: 39290937
If you have copyrights, then I would contact my lawyer.  You might also check out their webhost's TOS and send a note to them as well.  That has worked once for me.  You can look in your logs and detect what IP's spend the most time and visits.  

The problem with trying to block them is you would most likely kill your seo so a lawyer is the best way.

Things to try that would hurt seo would include:
Obfuscating the html with js something like document.write(unobfuscate(pagecontent)
Inject javascript in your code <script></script> where inside the script tags you hit a home made web service on your site that can detect the domain it is on and if it is not yours, replace all the text with big bold letters "Illegal content" or even better, auto redirect back to your site.

I wouldn't try and ban IP's because they can be spoofed.
0
 
LVL 25

Expert Comment

by:Ron M
ID: 39290962
You could also employ captcha, if a client requests a page x number of times in x amount of time.. require captcha input to view the page, if it fails redirect to "sorry no-bot's" page.
0
 
LVL 9

Expert Comment

by:David Carr
ID: 39290981
I suggested copyscape not copyscrape to see if other sites other than the one you found was also doing it.

Are you willing to explore contacting the owner of the scraper site using the domain name registration information?
0
 
LVL 39

Author Comment

by:gdemaria
ID: 39291223
Lots of good comments - thanks to everyone!   Not sure many of those are applicable to my situation, I'll try to explain that.

IDs - my data is in a table format, the cells/rows don't have IDs, but I understand the concept of changing things up to make it harder.

Blocking IPs may not work - I understand that they can get around this, but I wouldn't mind trying to find their IP in case they aren't that clever, that way  I could be sure who they are and set up something to monitor.

Scraping looks like any user - really?  there's no difference in user agent or maybe the identification of the browser?

Contact Lawyer - we may very well do this, but it will only help this once.  We were hoping for a mechanism (logs or something) that may help us notice suspicious activity and find the culprit.

Obfuscating the HTML may be a great approach for us provided we can do that to only the data tables and not the text around it.  The data table is useless to search engines anyway (just a bunch of numbers really).  

Inserting javascript into scraped info - I don't think this will work because they are scraping a table of data (like stock quotes).   So they would strip out any alpha characters just to get the numeric data anyway and that would lose the js.

Captcha - can't use it in my case as we don't do form posts to get the database, just go to the page and view it.  

Contacting owner  - we can do this, just working on an approach that would work for the next one who we don't know about.  It's really amazing that we found this one (remember is mostly just data with a little text so it's hard to identify as ours)


Thanks again - any thoughts on how to easily obfuscate the data table and still allow it to be viewed on browsers and mobile devices?
0
 
LVL 39

Author Comment

by:gdemaria
ID: 39291226
or again...  software I can use to search logs to identify the thief's activity either by frequency of hits, user agents, pages or any other activity..
0
Superior storage. Superior surveillance.

WD Purple drives are built for 24/7, always-on, high-definition security systems. With support for up to 8 hard drives and 32 cameras, WD Purple drives are optimized for surveillance.

 
LVL 25

Expert Comment

by:Ron M
ID: 39291250
"my data is in a table format"..
Right, but the <table> element may have a static ID ..(ie. <table id="Stocktable1")
They may simply be taking that ID with  getElementID.. to get the page object.
You could also make "empty" table element right before or after the actual table, ..sometimes have it appear, sometimes not.. which would "trick" their code if all they are looking for is a <table> tag.  They might hit the first table tag and come up empty.

"Captcha - can't use it in my case as we don't do form posts to get the database, just go to the page and view it.   "

Right, but ..you could track if a page is being called over and over again from the same client/IP address within a short period of time, and THEN have a captcha appear... to prove it's human.

I would definitely contact a lawyer though, no matter what.
0
 
LVL 52

Assisted Solution

by:Scott Fell, EE MVE
Scott Fell,  EE MVE earned 125 total points
ID: 39291264
>Scraping looks like any user - really?  there's no difference in user agent or maybe the identification of the browser?

You might be able to pick something up from your logs.  

Scrapping typically goes by a pattern or simply finding an ID or unique class.  Since it is just the table, some other options:

Change the code from using a table to a div using a grid system which is typically just floating your div's to the left and adding some margin right.  There are frameworks like a simple http://www.blueprintcss.org/ or responsive ready http://foundation.zurb.com/ or http://twitter.github.io/bootstrap/
Hiding the table data with css on load, then use query to make the data appear on a delay.  I have never tried t his,b ut a spider is going to hit the page and probably ignore the js Because it is hidden, it will not be rendered.  Actually, I would suggest not hiding it but rather use ajax to bring the information in from another page that only contains the data/table
Create a button that is not a button or hyperlink and use javascript to detect the click.  Once clicked, display the data via ajax just as above. The only difference is a button instead of a delay.
Since SEO is not an issue, make your data table an image.  If you need to run it dynamically, you can use sever side code to do this along with installing something like http://www.imagemagick.org/script/index.php on the server or if coldfusion supports this?
0
 
LVL 52

Assisted Solution

by:_agx_
_agx_ earned 125 total points
ID: 39291316
> Scraping looks like any user - really?  

If you're not doing it already, it'd be helpful to track stuff like user agent, and other headers anyway. Whether it turns up anything useful depends on how savy their scraping tool is.  Like if you didn't know any better, and used cfhttp, it'd send something like "ColdFusion ...." by default.

Maybe a stupid idea, but any chance of embedding some unique value, based on the current IP, in the data being scraped? Something a bot wouldn't recognize, but make it hidden to a normal user.  Ultimately the hidden value would show up on the scam site. If you combine it with a regular logging/stats table, at least you have an idea when it occurred and by which IP (for what it's worth).

(Edit) I've only got half a working brain today, so don't waste too much time on this idea if sounds completely un-feasible. Just throwing it out there...
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39291591
Scraping looks like any user - really?  there's no difference in user agent or maybe the identification of the browser?

If they're "doing it right", it just looks like a browser visited your page.  If they are lazy, they might be doing it everyday so you can find them because they come from the same IP address all the time.  If they are good and are doing this to others, they will have multiple IP addresses that they use.  They only have to load it once a day or however often you change your page.
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39292264

software I can use to search logs to identify the thief's activity either by frequency of hits, user agents, pages or any other activity..

assuming your server logs in apache log format, simple commands will let you see if the attacker is dumb enough to be easily found by ip or user agent

assuming your log is in NCSA format (which is a pain to parse), here are a few examples

ips that produced the greatest number of queries
cat logfile | cut -d ' ' -f 1 | uniq -c | sort -n

same thing hour by hour (will be messed up if you use ipv6 but easy to change)
cat logfile | sed 's/\([0-9.]*\)[^:]*:\([^:]*\).*/\2 \1/'  | uniq -c | sort -n

if you are reasonably fluent with the command line, it should be pretty easy to adapt. we'll help if needed.

other than that, you'll find plenty of log analysers around. awstats is pretty good
0
 
LVL 26

Assisted Solution

by:skullnobrains
skullnobrains earned 125 total points
ID: 39292277

any thoughts on how to easily obfuscate the data table and still allow it to be viewed on browsers and mobile devices?

load the data through ajax polls should work on mobile devices and should be rather efficient

writing using js as well : the idea is not to embed js, but write the data in the cells using js so removing the js will not help

you can also use CSS :content in a similar way

you can generate an image with the data in the table

you can generate many partially overlapping images to achieve the same goal

whatever such means you use, if they are writing code to specifically scrape your site, they probably can adapt. any chance they are doing the scraping manually ?

---

as suggested above, if you actually can make them load unique information (maybe storing an incremental number in some data that has many digits that are not significant to most people and log the ips and numbers, you can both find the ip(s), and use that as evidence later on.
0
 
LVL 9

Expert Comment

by:David Carr
ID: 39327486
Do you have any updates to share on this?
0
 
LVL 39

Author Comment

by:gdemaria
ID: 39327515
Thanks for checking back.    My first task is to find who is doing it, then use one of the suggestions to change it (such as ajax), but that will take a little while.

I attempted to place hidden text of the IP address of the scraper into the text field, but the scraper seems to strip out all HTML from the text.

I would really like to find logs to see who is likely scraping, Google Analytics does not seem to go down to the level of individual access so I can't see IP x.x.x.x accessed the site every 2 minutes between midnight and 8 AM... something like that.   I am trying to hunt down a log viewer that may give me that info.. any thoughts?  IIS 5.
0
 
LVL 52

Expert Comment

by:Scott Fell, EE MVE
ID: 39327541
You need to use your server stats package to get that info.
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39327550
"server stats package" for IIS5?  Think Windows 2000.  You should really get out more @padas.
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39327631
awstats will do the trick on windows as well
http://www.howtogeek.com/50526/setting-up-awstats-on-windows-server-and-iis/

you can also run the commands i provided in cygwin or bash-for-windows but if you're not used to the unix shell, it will likely be difficult to adapt them. i'll give you explanations if you want to go down that road.

if you think it is acceptable to provide a log file in whatever non binary format available, i'm ready to extract the information you want and post back. (W3C format would be great)
0
 
LVL 52

Expert Comment

by:Scott Fell, EE MVE
ID: 39327837
Dave, I have been using that package since about 1999/2000 on a shared service.  You would know better about the legacy stuff then I would.   The old version of smarterstats 3.5 will run under iis5 http://smarterstats.en.softonic.com/ (not 100% sure of that link since it is not from the ss site)

If it works, it will tell you what is requested rather then what is served like google.  You can see where people are trying to access even when it is not there (like /wp-admin/).  There is some basic data mining.

Good luck.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Today, I was working on some optimization and spam-stopping techniques when I encountered Ben Nadel's post to reduce spam feature using Math (http://www.bennadel.com/blog/197-How-I-Stop-Spammers-On-My-ColdFusion-Blog.htm). While this method is not o…
Hi, Even though I have created this Tutorial on My personal Blog, Some people might not able to find my website, So here i am posting it again Today, from the topic it is very clear that i will be showing you here the very basic usage of how we …
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now