Solved

Too much traffic?

Posted on 2008-06-10
14
633 Views
Last Modified: 2013-12-07
A colleague (domain obfuscated by me) writes:

"Much as we love the idea of higher traffic, the big numbers become less desirable when it becomes clear that there are no actual eyeballs behind many of those visits.

We're seeing a surge in non-human traffic on Domain Online, and are not quite sure what to do about it. These are not day-to-day spikes separated by days of "normal" traffic. Since March of this year, there has been a steady increase in stress on our servers. Four weeks ago, traffic that was already double or triple our "normal" numbers quadrupled. This traffic has been so significant that our webservers (Two IIS servers) have, on occasion, been knocked or nearly-knocked offline.

For the month of May, our Urchin Tracking Monitor, which counts only those sessions and page views from browsers accepting javascript, showed daily traffic at about 35,000 sessions, 3,400,000 hits and 80,000 page views per day. Over the same period our non-UTM stats, which includes traffic of all sorts, show daily traffic of 105,000 sessions, 3,400,000 hits and 1,175,000 page views. Of the total hits we're showing 3,035,000 coming from Robots (63% coming from the Mozilla Compatible agents, 1.3 million identifying themselves as the Googlebot).

We're using a Windows 2003 SQL server database. Our tech folks ruled out the possibility of an SQL injection attack because we weren't getting hit by a single domain range.

Anybody have experience with these kinds of ratios of human-to-non human traffic? Will adding webservers help us? Other solutions?"

Any suggestions would be appreciated.

ep
0
Comment
Question by:ericpete
  • 4
  • 3
  • 2
  • +4
14 Comments
 
LVL 6

Assisted Solution

by:Jk387
Jk387 earned 65 total points
Comment Utility
They may be being hit with a "denial of service" attack.  There are methods out there to use multiple computers or servers to essentially bombard your web server with traffic to prevent the website from being viewed or functioning resonably.  I am no expert on fixing this type of thing at all unfortunatly i just thought you would appreciate the idea.
0
 
LVL 3

Assisted Solution

by:sajain84
sajain84 earned 65 total points
Comment Utility
I've had a friend of mine who had a very popular forum setup and was managing it.
He would complain that due to the regularity with which google would index his website, his bandwidth quota would get used up quickly.

Incase you have a similar site (forum, blog, etc) with lots of pages, you could probably try a robots.txt file which has instructions on when and how frequently to index your site. All major search engines will follow these rules.

The other thing could be that someone is scraping off your website - which means running bots which extract data from your web pages and storing / using them elsewhere.
This is not uncommon for sites which are of the directory listing type.

In such cases, if you can zoom in on the IPs which are hitting your website, you could block access to them from your server. (Reject access to IPs from a list)

The final thing can be a denial of service attack.
This is when some people get together and hit a website with tonnes of requests so that the servers go down.

In your case, I have a strong feeling its either 1, 2 or both.
Incase it is a DOS attack, you'll have to probably contact some security experts which can guide you to a better solution.

I'm sorry that I haven't been able to cleanly help you out - but I do hope this helps.
0
 
LVL 15

Author Comment

by:ericpete
Comment Utility
It's likely that the site is being Digged etc., which could account for a lot of the traffic, I suppose. They seem to doubt that it's a DOS attack based on the logs, as noted in the question, but I'll pass it along too.

I'm going to leave this open to see if I get any other ideas/suggestions. I'll also answer any questions as best I can in order to get some other specifics.

ep
0
 
LVL 3

Expert Comment

by:sajain84
Comment Utility
Could you let us onto the nature of the website?
Whether it is a forum, blog, etc?
This would throw some more light on the situation.

Getting digged is unlikely - because you mentioned that most of the responses are automated / bot responses.
The digg effect (or when your site gets digged) is when actual people open the site and look at it.
This will register in your urchin records as actual people and not automated bots.
0
 
LVL 15

Author Comment

by:ericpete
Comment Utility
www.poynter.org (realized that there's no compelling reason to hide it).

There's a lot going on there as you can see.

ep
0
 
LVL 50

Assisted Solution

by:Steve Bink
Steve Bink earned 95 total points
Comment Utility
It could be because your popularity is skyrocketing.  :)

If people are linking to your website, spiders will follow those links back to you.  If you have generated content, a blog, for example, the spider will crawl every link it finds, where a human will only follow one or two.  That would obviously cause this kind of disparity in numbers.  In the 65000 non-Urchin sessions, approximately 1.1 million page hits were generated.  That certainly sounds like being spidered.

Depending on the nature of your site, and how well it places in search engines, you could be getting scraped by spammers.  These guys just set up a bot net to pull all the information it can find from your site.  That info is then used to populate a bait site for search engines.  These scraping clients will generally show up as non-JS bots.

I doubt it is a DOS attack.  If it were a real DOS attack, your server would be unresponsive in a matter of minutes each and every time you put it up.  Also, Google does not generally participate in such attacks, and they account for a large number of hits in your stats.

Do you subscribe to any third-party certification services, such as ScanAlert?
0
 
LVL 51

Assisted Solution

by:Keith Alabaster
Keith Alabaster earned 95 total points
Comment Utility
Hey mate - seems to have been a while since I saw your name on a question :)

Can you provide some more detail on the hits been seen? Are they all arriving on port 80 of this site or is the protocol spread larger than that? ie Is the number of hits being seen on the published web sites less than the number of hits on the external interface of the router/firewall on the outside?

For example, we have approx 500K hits per day according to our (Government Agency) web site but we also block over 100,000  spam mails per day, and god knows how many potential hits on ports that are not even open on the external firewall (but they still get logged).

What are the ranges of ip addresses? Can they be tracked back to a particular country/continent?

We have had to cop out and introduced a pair of Cisco Load Balancers - we have not yet found a way to strip out unwanted traffic that still met the ip/port requirements. We did try and put up an 'accept Terms' page that timed out and dropped the connection after 30 seconds to try and stall some of the bots and we also put the traditional options in to exclude spiders/crawlers etc. That helped a lot but the Teerms page had no real noticeable effect.

Keith
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 19

Assisted Solution

by:Redimido
Redimido earned 65 total points
Comment Utility
I would try to check with the google tools to see if they are the ones hitting the web site (https://www.google.com/webmasters/tools) if not, I would try to limit accepting requests from any agent named googlebot not coming from google network. this can continue until you sort out all the visits you do not want.

As said, your primary tool is the robots.txt file. you will need to analyze the web logs to see how much traffic it is from each one, and if you want to be searched from all of them. you can even see if adding metatags to the web pages you do not want listed helps. this is an ongoing effort and I suggest you get in touch with somebody at google
0
 
LVL 15

Author Comment

by:ericpete
Comment Utility
routinet,

I've asked if they subscribe to something like ScanAlert, but your explanaton of spiders seems more plausible.

keith_alabaster,

At this point, I ask questions more for other people than I do for myself; that means that either I'm not doing anything I haven't been doing for a while, or I have all my acquaintances so buffaloed that they think I know everything.

I've not received a reply to the message I sent them (see my comment to routinet; your questions were in the same email), but when I do, I will post immediately. I have also asked if they have considered load balancers.

Redimido,

Thanks. I've already sent to them some information on the use of robots.txt to limit the Googlebot-type scans; in it, I did mention non-Google googlebots, though I've never personally heard of such a thing. That's not to say they don't exist -- just to say that I've never seen one.

ep
0
 
LVL 3

Expert Comment

by:sajain84
Comment Utility
I kinda agree with routinet.
It could very well be because your popularity is increasing.

A search on alexa and compete - does show an increase in people visiting the website since Jan 08 and it has been an upward trend.

So I am guessing as more people are reading the content and then blogging and linking back to the articles on their blogs, it could very well be the spiders crawling on your website.
0
 
LVL 23

Accepted Solution

by:
Mysidia earned 115 total points
Comment Utility
It doesn't sound like an attack.
An ongoing blatant  attack would generate many more requests than indicated.

But "not getting hit by a single domain range" is not a valid basis for ruling out SQL injection attack as occuring or not.   (But increased hit count is not characteristic or indicative
 of an SQL injection problem, either)

Something to keep in mind is that bot traffic has a role -- when the site is indexed by search engines, you get more visitors from search engines - the more search engines that have indexed content, the more search engine users will follow the link.

You can drop an entry in robots.txt  to disallow all crawlers, or all crawlers except known ones, but it is probably a bad idea, if you want the site to grow and have many human visitors.

The really bad search engines won't look for robots.txt anyways, and the best thing to do is identify those by their unusual activity, and ban them by IP.




If servers are being nearly knocked off by what is becoming ordinary traffic, then the next logical step is probably to work out a plan for scaling up the application,  

so there is at least a little breathing room for the site to grow and survive unexpected bursts in traffic (flash crowds).

That may involve buying more bandwidth, adding more servers -- load-balancers, web servers, database servers, etc.

And possibly development work -- changes to the scripts that drive the site, increased use of memory caching, tuning of the database system, possibly re-examination of choices of server software, database, API/framework, etc,  all in order to function acceptably on the larger scale.

Many efficiency considerations that don't matter on a small site become a lot more important when you have more visitors.

5 million hits a month.
Is actually not very many.

Only ~2 hits per second  on average.

If load that small is crashing web servers, then it seems like either the servers are old/slow
or there may be some serious application inefficiency issues.


0
 
LVL 19

Expert Comment

by:Redimido
Comment Utility
indeed. you need to work into making the site scalable. maybe a load balancer, or a caching network is on order.
0
 
LVL 15

Author Comment

by:ericpete
Comment Utility
Thank you, all. I appreciate the ideas.

As I have not heard back, I'm going to close this question; if some specifics are requested of me, I will open a new question using the Ask A Related Question feature to ensure that you are all notified.

Great work, folks.

ep
0
 
LVL 51

Expert Comment

by:Keith Alabaster
Comment Utility
:)
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Any business that wants to seriously grow needs to keep the needs and desires of an international audience of their websites in mind. Making a website friendly to international users isn’t prohibitively expensive and can provide an incredible return…
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now