web crawlers shutting down my site

I have a tomcat site running on a server2008r2 machine.

My site is no longer responsive.

a netstat -a command yields the results in the attached file

how do I block all those things from jamming my port 80?

thanks
netstata2.txt
gateguardAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Guy LidbetterCommented:
Hi gateguard,

interesting read here on managing spiders on a site:
https://en.wikipedia.org/wiki/Robots_exclusion_standard

You could configure the CrawlerSessionManagerValve.
http://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve

you should also consider preventing Tomcat from putting the session ID into the URL.
Otherwise it appears in search engines.
From Tomcat 7 this should work...

<session-config>
   <tracking-mode>COOKIE</tracking-mode>
</session-config>

Open in new window

gateguardAuthor Commented:
Thanks, Guy.  I'm going to try these solutions.

I do have a question regarding the wikipedia article, something I don't understand.

The article talks about keeping crawlers out of the site or out of specific folders, but what I don't understand is why are the crawlers establishing all those port 80 connections, which is effectively shutting down the site.  Is that their intent?

Thanks again.
Guy LidbetterCommented:
Some search engine spiders are pretty aggressive, for example the "baiduspider" entries are from a Chinese Search Engine called "Baidu".

You hear a lot of stories about them where they will have 30 IP's all hitting your site, one after another, going through each and every page like that, and every page is 30 different sessions.

That's how the scrappers roll apparently...

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Acronis True Image 2019 just released!

Create a reliable backup. Make sure you always have dependable copies of your data so you can restore your entire system or individual files.

Lucas BishopClick TrackerCommented:
Not every crawler will respect it, but you could use your robots.txt file to disallow crawling by most bots:
# Begin block Bad-Robots from robots.txt
User-agent: asterias
Disallow:/
User-agent: BackDoorBot/1.0
Disallow:/
User-agent: Black Hole
Disallow:/
User-agent: BlowFish/1.0
Disallow:/
User-agent: BotALot
Disallow:/
User-agent: BuiltBotTough
Disallow:/
User-agent: Bullseye/1.0
Disallow:/
User-agent: BunnySlippers
Disallow:/
User-agent: Cegbfeieh
Disallow:/
User-agent: CheeseBot
Disallow:/
User-agent: CherryPicker
Disallow:/
User-agent: CherryPickerElite/1.0
Disallow:/
User-agent: CherryPickerSE/1.0
Disallow:/
User-agent: CopyRightCheck
Disallow:/
User-agent: cosmos
Disallow:/
User-agent: Crescent
Disallow:/
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow:/
User-agent: DittoSpyder
Disallow:/
User-agent: EmailCollector
Disallow:/
User-agent: EmailSiphon
Disallow:/
User-agent: EmailWolf
Disallow:/
User-agent: EroCrawler
Disallow:/
User-agent: ExtractorPro
Disallow:/
User-agent: Foobot
Disallow:/
User-agent: Harvest/1.5
Disallow:/
User-agent: hloader
Disallow:/
User-agent: httplib
Disallow:/
User-agent: humanlinks
Disallow:/
User-agent: ia_archiver
Disallow:/
User-agent: InfoNaviRobot
Disallow:/
User-agent: JennyBot
Disallow:/
User-agent: Kenjin Spider
Disallow:/
User-agent: Keyword Density/0.9
Disallow:/
User-agent: LexiBot
Disallow:/
User-agent: libWeb/clsHTTP
Disallow:/
User-agent: LinkextractorPro
Disallow:/
User-agent: LinkScan/8.1a Unix
Disallow:/
User-agent: LinkWalker
Disallow:/
User-agent: LNSpiderguy
Disallow:/
User-agent: lwp-trivial
Disallow:/
User-agent: lwp-trivial/1.34
Disallow:/
User-agent: Mata Hari
Disallow:/
User-agent: Microsoft URL Control - 5.01.4511
Disallow:/
User-agent: Microsoft URL Control - 6.00.8169
Disallow:/
User-agent: MIIxpc
Disallow:/
User-agent: MIIxpc/4.2
Disallow:/
User-agent: Mister PiX
Disallow:/
User-agent: moget
Disallow:/
User-agent: moget/2.1
Disallow:/
User-agent: mozilla/4
Disallow:/
User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
Disallow:/
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows ME)
Disallow:/
User-agent: mozilla/5
Disallow:/
User-agent: NetAnts
Disallow:/
User-agent: NICErsPRO
Disallow:/
User-agent: Offline Explorer
Disallow:/
User-agent: Openfind
Disallow:/
User-agent: Openfind data gathere
Disallow:/
User-agent: ProPowerBot/2.14
Disallow:/
User-agent: ProWebWalker
Disallow:/
User-agent: QueryN Metasearch
Disallow:/
User-agent: RepoMonkey
Disallow:/
User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow:/
User-agent: RMA
Disallow:/
User-agent: SiteSnagger
Disallow:/
User-agent: SpankBot
Disallow:/
User-agent: spanner
Disallow:/
User-agent: suzuran
Disallow:/
User-agent: Szukacz/1.4
Disallow:/
User-agent: Teleport
Disallow:/
User-agent: TeleportPro
Disallow:/
User-agent: Telesoft
Disallow:/
User-agent: The Intraformant
Disallow:/
User-agent: TheNomad
Disallow:/
User-agent: TightTwatBot
Disallow:/
User-agent: Titan
Disallow:/
User-agent: toCrawl/UrlDispatcher
Disallow:/
User-agent: True_Robot
Disallow:/
User-agent: True_Robot/1.0
Disallow:/
User-agent: turingos
Disallow:/
User-agent: URLy Warning
Disallow:/
User-agent: VCI
Disallow:/
User-agent: VCI WebViewer VCI WebViewer Win32
Disallow:/
User-agent: Web Image Collector
Disallow:/
User-agent: WebAuto
Disallow:/
User-agent: WebBandit
Disallow:/
User-agent: WebBandit/3.50
Disallow:/
User-agent: WebCopier
Disallow:/
User-agent: WebEnhancer
Disallow:/
User-agent: WebmasterWorldForumBot
Disallow:/
User-agent: WebSauger
Disallow:/
User-agent: Website Quester
Disallow:/
User-agent: Webster Pro
Disallow:/
User-agent: WebStripper
Disallow:/
User-agent: WebZip
Disallow:/
User-agent: WebZip/4.0
Disallow:/
User-agent: Wget
Disallow:/
User-agent: Wget/1.5.3
Disallow:/
User-agent: Wget/1.6
Disallow:/
User-agent: WWW-Collector-E
Disallow:/
User-agent: Xenu's
Disallow:/
User-agent: Xenu's Link Sleuth 1.1c
Disallow:/
User-agent: Zeus
Disallow:/
User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow:/
User-agent: rogerbot
Disallow:/
User-agent: mj12bot
Disallow:/
User-agent: dotbot
Disallow:/
User-agent: ahrefsbot
Disallow:/

Open in new window

gateguardAuthor Commented:
where do I put a robots.txt file?  I don't have one right now.
Lucas BishopClick TrackerCommented:
It goes in the root folder of your web site.
Guy LidbetterCommented:
hi gateguard,

The wiki page in my very first post explains everything about the robots file.

Regards

Guy
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Apache Web Server

From novice to tech pro — start learning today.