Solved

If no "head" response, will robots find a page?

Posted on 1997-06-02
3
266 Views
Last Modified: 2013-12-25
The server does not support 'head' request. When robots
send it, it will be simply ignored. The question:
how common is it for the robots to make a quiry for
'head' only and not for the page content?
e.g. - How big will be a user's loss on such a server?
Are Infoseek, WebCrawler, other common robots making
query for <head> only or for page content as well?

   
0
Comment
Question by:elena515
  • 2
3 Comments
 
LVL 5

Accepted Solution

by:
julio011597 earned 100 total points
ID: 1854272
There's no robot around which takes just the head part of a page.
Actually, there is no 'head request' in the http protocol.

What a crawler usually does is to get the page (the whole page is send by the web server), then parse a given amount of it; how much of the page will be kept depends on the settings, and may vary from the first few lines to the whole document.
Usually, the international search engines keep just part of a document for reasons of performance and resources, while a national search engine may keep the whole page, since the total count of pages is far lesser.

So don't care too much about headers; they are usually useful just to give a document title and some extra keywords to the indexing engine.

Rgds, julio
0
 

Author Comment

by:elena515
ID: 1854273
Dear Julio,

i wouldn't rate the answer, because the answers i've got from
w3 and robots mailing list are quite different ;)

here are some of them:

> The server software we developed does not support
> 'head' request from the robots, it simply ignores
> the 'head' query.
>
> Is it an appropriate approach within current w3 standards?

No. HTTP servers MUST support HEAD method. Those are IETF standards, BTW.
See RFC 1945 and RFC 2068.

> How common are robots that make query for head only
> and ignore the content of the document?

Common enough. Besides, there are browsers which generate them if the user
requests it.

> In other words, how much we have our users suffer,
> and how much we are in disaccordance with the standards,
> if our server will be released with this deficiency?

Very much. Implement that, please. It can't be too much work.

----------

On Jun 2, 12:51pm, elena danielyan wrote:
> Subject: 'head' request and robots?
> I apologise if this request is inappropriate here.
>
> The server software we developed does not support
> 'head' request from the robots, it simply ignores
> the 'head' query.
>
You mean it ignores head requests.  Unless you are looking at the
browser-type, you do not know whether the resquest came from a robot
or a real person, ( UA request for a user ).

> Is it an appropriate approach within current w3 standards?

Servers are required to support head.

> How common are robots that make query for head only
> and ignore the content of the document?
> In other words, how much we have our users suffer,
> and how much we are in disaccordance with the standards,
> if our server will be released with this deficiency?
>
Robots, caching proxies, and caching browsers will generally make a head
request to see if the document it has is older then the one currently on
the server.  In addition, robots will retrieve any <META> containers to use
for additional search criteria.

------

HEAD is used primarily as a tool to check whether a URL has changed
since the last time it was retrieved.


Netscape Navigator and other browsers use HEAD to check whether
(and when) to reload pages, images, etc.  For example, when you
use the Reload button of Navigator, it will issue a HEAD request
to the server for that page in order to determine whether it has
to issue a new GET request, or whether it can re-use the data
in its cache.  Navigator also does that for the images on a page
when the user requests a Reload.  [I believe other browsers do
this as well, but I haven't checked them to be sure.]

By not providing HEAD data for browsers, several things will happen:
people who view the pages on your server will experience longer
time-outs when they go to Reload a page (under some circumstances,
the same thing happens simply when they go back into their history
list to return to a page as well).  And the browser software then
has to decide on its own whether or not to reload the page, and/or
the other elements (images, etc.).  I haven't experimented to find
out what each different version of browser actually does, but
whether they load from cache or do another GET to the server, each
will be wrong under some circumstances.  If the browser doesn't
GET the page again from the server, the user could well be shown
an old version of the page; if the browser does an unnecessary GET,
that will unnecessarily increase the traffic for your server.


However, the primary use of HEAD is probably not by browsers, but by
search and index sites.  At least some, probably most of them (certainly
the regional index site that I run) use HEAD to check:  1) whether a
linked-to-page still exists; 2) whether it has changed (if so, the
search engine spider should and usually actually does a GET in order
to check the page's possibly-new title and content).

If your server does not respond to a HEAD request, there is a fairly
high probabilty that pages on the server will end up being deleted
from at least some of the indexing and cataloging sites
within--typically--about a month or so.  I.e., when the cataloging
site re-checks the page via HEAD to see if it's still there.

There are a few other HTTP server software packages which do not respond
properly to HEAD requests (mostly old Macintosh HTTP server implementations).
I know people who used to run servers like that.  Who found out the
hard way what the effects were when the index and catalog sites kept
dropping their pages...


Nontheless there are situations where you might want to have some
specialized kinds of server software that doesn't respond to HEAD
requests, or responds in special ways.  For example, if ALL the
content you're serving out is dynamic, you arguably might always
want to respond to a HEAD request for a potential page with an
expiration date that is in the past.  Alternatively, if you are
serving out pages which should never be catalogued anywhere, in
addition to using an appropriate robots.txt file in your server
root, you might also always want to respond to HEAD requests with
an error code saying the request page doesn't exist.  That makes
it a fairly high probability that a page accidentally indexed
will (whenever it's re-checked) end up being removed by from
most search sites.
-----


Regards

Elena

0
 
LVL 5

Expert Comment

by:julio011597
ID: 1854274
Dear Elena,

i couldn't realize you were developing a web server, because you didn't mention it in your question.

Yours shows up as a 'user' question, mainly involved on "how should i make my pages visible to search engines, while my (ISP) web server doesn't support the head request".
The answer you've gotten stay consistent with the question.

Please, next time, spend a few more seconds in formulating your question, so that neither you nor anybody else lose their time.

Cheers, julio

P.S. thanks for the enlightenments.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…
The viewer will learn how to count occurrences of each item in an array.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now