Would this be considered cloaking?

We run an inventory listing service with 2.1 million unique items. The script we originally had, which had worked for a long time, queried the database and built a table with "pages" of the data with 10000 records each.
Something changed in the last couple of months where the bots were beginning to hit that script a lot more and it began to bring our server to its knees with the database requests.

We finally had to disable the script and find another way. We now create daily static html pages of the data just like the old script created but these are stored in one directory which I have the bots access via robots.txt.
The static pages list one column from each product we list in our inventory listing service, approx 2.1 mil items.

I do not want the bots to access anything else on our site because of the database interaction.

Human users can't see the "pages" that we created in the bots dir but the data is viewable in other areas of the site by the human users.

The script we formerly used did exactly the same thing but was dynamic in nature. However this "dynamicness" made it unusable because of the interaction with the DB.
A human user cannot see the data in that manner.

Now could that be deemed cloaking?

If you want to check it out...

The site is http://www.listinventory.com

The first static page would be :
http://www.listinventory.com/bots/listings_1.html
LVL 26
Eddie ShipmanAll-around developerAsked:
Who is Participating?
 
Jeffrey DakeConnect With a Mentor Senior Director of TechnologyCommented:
Usually you would just have one site map that was a link just called sitemap.xml.  Within that file you would have a whole bunch of links to your other pages you wanted in the site map.  This makes it so there is only one sitemap you have to submit to google and makes it easier to crawl.

so your robots.txt would be like

Sitemap: http://www.listinventory.com/sitemap.xml
User-agent: *
Crawl-delay: 10

Then within your sitemap.xml you would have something like

<sitemap>
  <loc>http://www.listinventory.com/sitemap_1.xml</loc>
  <lastmod>2011-04-01</lastmod>
 </sitemap>
<sitemap>
  <loc>http://www.listinventory.com/sitemap_2.xml</loc>
  <lastmod>2011-04-01</lastmod>
 </sitemap>


You do not want to but disallow: / in your robots.txt.  That is saying disallow every directory on your site.  If you want to disallow something you should put the file name or directory in the disallow.


As for the crawl-delay I have also heard that google ignores it, but it is still good to have as other bots will still read it.
0
 
freshcontentCommented:
The way that I've heard Matt Cutts at Google describe cloaking, or what Google would describe as actionable is: If you show a bot something different than what a human would see.  Now, if the text that the bot and the human see are identical, then it would be unlikely to be actionable, but I think that by redirecting the bots in your robots.txt, you are definitely taking a risk.

Why not have the static page that you are talking about get regenerated once a day, or even more frequently, like once an hour, and then have that static page be what both the users and the bots see?
0
 
Eddie ShipmanAll-around developerAuthor Commented:
No, because humans are able to SEARCH the listings and bots don't need to do that if they have the static files.
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

 
Eddie ShipmanAll-around developerAuthor Commented:
The bots cause problems when crawling the site through the search facility.
0
 
freshcontentCommented:
Yes - I agree - if you have a search facility, the search page(s) should be <noindex> <nofollow> in the robots.txt.

If you are looking to create a static copy of all of your website pages, I would suggest creating a page called sitemap.html and having it be a static listing of all of your pages.  That is an acceptable and normal way for bots to crawl all of your pages.

Also - you can submit an XML sitemap using the process described at http://sitemaps.org and both Google & Bing will use that to crawl your site.
0
 
Jeffrey Dake Senior Director of TechnologyCommented:
Looking at your site I would address a few more things rather that if you are cloaking or not.  First of all the pages that you are sending to google are just lists of the product number but not the actual page of where you want to drive traffic.  If I were you I would have a unique url and landing page for each Item that is submitted.  You want to have unique descriptive information about each one and you want to drive your users to a page that is usable.  I did a google search for your site and pretty much every search just takes me to a page like http://www.listinventory.com/index.php?mode=search&term=E24400-9.  There is nothing that drives me to action.  In fact every page pretty much looks the same with very little information that drives the user to action.  It also looks like your robots.txt says to disallow: /search, but your search results are displayed on index.php like in the above link.  I would concentrate your work on having landing pages for each product that both your users and google would come in to, and the not have your list pages indexed.  It is good practice to not have lists indexed as they can change and should just be used for navigation to discover the useful content.

Hope this helps.
0
 
Eddie ShipmanAll-around developerAuthor Commented:
@freshcontent - I have a sitemap.xml file already
@jman56 - We DO NOT want to drive people to a particular page because we do not sell the items listed, We are only a listing service and users search for parts and then, if they have an account, can find out who is listing it. We only want our site to come up listed when people search the SERPS for a particular part number. On another note, the term you searched for results in an error: "We could not find any results for your search... Please try again."

The bots indexing 2.6 mil records using the search facility cause a big hit on the DB. We are trying to avoid that because that is what caused the last bunch of problems.

Any more ideas?
0
 
Jeffrey Dake Senior Director of TechnologyCommented:
I wasn't trying to recommend you not be a listing service.  I was trying to recommend driving the traffic to a page the shows the information you are looking for.  Right now I went to this page and I had to scroll the window down to see the results.  http://www.listinventory.com/index.php?mode=search&term=-1DB.  This will usually drive traffic away when they come in from a search engine if they do not see what they want immediately.  

Also I was recommending another page, because you have other information about the listings.  Each listing has a description, condition and alt pin numbers that don't seem to be in what you are sending to google at http://www.listinventory.com/bots/listings_1.html.

Also with the search that returned "We could not find any results for your search... Please try again."  That was not an internal search on your site I did.  I got that result by searching for your site in google.  I don't think that search result page is what you are hoping users come in on.

As for the static pages there should be nothing wrong with that.  As long as the bots are seeing the same information as your users you should be good.

To wrap it up, I don't think there is anything wrong with your static page strategy, I just think you are missing out on some seo opportunities on your site.
0
 
Eddie ShipmanAll-around developerAuthor Commented:
OK, so looking at sitemap.org. I figured it best to create sitemap text files with each of the 2.6mil URLs at 50000 per file.

However, I'm unclear about the robots.txt entries. Should they be like this:

Sitemap: http://www.listinventory.com/sitemap_1.txt
Sitemap: http://www.listinventory.com/sitemap_2.txt
..
..
Sitemap: http://www.listinventory.com/sitemap_40.txt

Then I can also do this in robots?
User-agent: *
Crawl-delay: 10
Disallow: /

BTW, Google says it ignores the crawl-delay, is that correct?
0
 
Eddie ShipmanAll-around developerAuthor Commented:
Ok, I am generating my sitemap listings and robots.txt file like this:
User-agent: *
Sitemap: http://www.listinventory.com/sitemap_1.txt
Sitemap: http://www.listinventory.com/sitemap_2.txt
Sitemap: http://www.listinventory.com/sitemap_40.txt
Sitemap: http://www.listinventory.com/sitemap_41.txt
Crawl-delay: 10
Disallow: /search
Allow: /

Open in new window

The sitemap_1.txt files are about 3MB each and in the root dir.
I'd rather not generate XML sitemaps because of the file size.
0
 
Eddie ShipmanAll-around developerAuthor Commented:
Could I have it like this:
<sitemap>
  <loc>http://www.listinventory.com/sitemap_1.txt</loc>
  <lastmod>2011-04-01</lastmod>
 </sitemap>
<sitemap>
  <loc>http://www.listinventory.com/sitemap_2.txt</loc>
  <lastmod>2011-04-01</lastmod>
 </sitemap>

Open in new window

0
 
Jeffrey Dake Senior Director of TechnologyCommented:
Your second format would be your xml format if you have pages that are updated more often than others, the xml format would be good because you can let google know which pages were updated when.  Otherwise your text option above is fine.
0
 
Eddie ShipmanAll-around developerAuthor Commented:
XML is just too verbose compared to the text format. So could I have the sitemap index file that has links to the sitemap files as XML but the sitemap files with the URLs be text?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.