Need a way to check a web sites cacheability status - from inside a c# script.


I have a C# script, I need to enter some code into the C# script that will take a URL and
test all the embedded HTML object in that URL and give me a readable (XML) response
about which objects are cacheable and which are not, including the file size and content type of each file.
Using C# to run an external application is also fine, and even a CGI application (Though I don't like CGI).

There was a tool called cacheability on the internet once, you could also download a command line version of it.
The tool was put down, and replaced with Redbot, the problem with Redbot is that it doesn't respond using an XML,
but rather HTML code which will need heavy parsing to use.

If anyone has an idea for an application or maybe a simple way to check it, I'll be glad to hear.

Who is Participating?
Tony McCreathTechnical SEO ConsultantCommented:
You could write a crawler in C# to do it:

  1. Load the html page (WebClient or HttpWebRequest)
  2. Parse the html to find the referenced URLs (Regex)
  3. Query each discovered URL and discover its cache settings (HttpWebRequest) and capture the cache info from the headers. (related headers are cache-control & expires)
  4. Create Xml of your captured data and store it (XmlTextWriter or XmlDocument)
References to help on how to do each part:

eitamaAuthor Commented:
Thank you Leo,
I'll cross my fingers (:
eitamaAuthor Commented:
Inside the HTML,

How do I know which URLs are objects that should be loaded, and which are just links which are only loaded
when a user clicks them on a browser?

If I just go around looking for "http://..." I will probably fail.
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Tony McCreathTechnical SEO ConsultantCommented:
The following answer (my 4th reference) relates to finding images within a page.

You may also be interested in other embedded objects such as included files (javascript, css), flash files etc. You will need to determine the pattern of html that is used and create regex expressions to acquire the urls from them.

It is very hard to create a system that's 100%. Google hasn't done it yet! Html can be written by websites in many ways and is often invalid. One of the hardest areas is when content is loaded from within javascript.

eitamaAuthor Commented:

I'd like to mimic a browser,
all the objects that a browser would decide to fetch i'd like to fetch as well, excluding ajax and javascript generated GETs is fine.
I am very familiar with regexp, so that's not a problem.

The only thing i'm now missing is how to distinguish between items a browser would fetch, like images/sounds/icons/css/js/flash, and links that would not be accessed
till the user clicks them, inside the HTML.

The link you posted helps me on understanding the img tag format, but what about all the others? flash/css/js... etc...

Tony McCreathTechnical SEO ConsultantCommented:
There is a way to embed an IE control within .Net and control it. You may be able to use that to determine what files get loaded. That is outside my skill base though.

Telling you how to scrape all the different element types is getting a little too close to my business. You can probably find examples around the net or research in the html syntax for the tagsyour interested in. As hints you may be looking at the following tags:


and their patterns will be similar to the img example. With variations of the attribute used to store the url

You will find you have to tweak the expressions as you use them on more websites so they work in a consistent manner.
eitamaAuthor Commented:

Thank you for your help.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.