?
Solved

Need a way to check a web sites cacheability status - from inside a c# script.

Posted on 2009-12-26
9
Medium Priority
?
394 Views
Last Modified: 2012-06-27
Hi,

I have a C# script, I need to enter some code into the C# script that will take a URL and
test all the embedded HTML object in that URL and give me a readable (XML) response
about which objects are cacheable and which are not, including the file size and content type of each file.
Using C# to run an external application is also fine, and even a CGI application (Though I don't like CGI).

There was a tool called cacheability on the internet once, you could also download a command line version of it.
The tool was put down, and replaced with Redbot, the problem with Redbot is that it doesn't respond using an XML,
but rather HTML code which will need heavy parsing to use.

If anyone has an idea for an application or maybe a simple way to check it, I'll be glad to hear.

Thanks,
Eitam.
0
Comment
Question by:eitama
  • 4
  • 3
7 Comments
 
LVL 3

Author Comment

by:eitama
ID: 26134689
Thank you Leo,
I'll cross my fingers (:
0
 
LVL 23

Accepted Solution

by:
Tony McCreath earned 2000 total points
ID: 26135286
You could write a crawler in C# to do it:

  1. Load the html page (WebClient or HttpWebRequest)
  2. Parse the html to find the referenced URLs (Regex)
  3. Query each discovered URL and discover its cache settings (HttpWebRequest) and capture the cache info from the headers. (related headers are cache-control & expires)
  4. Create Xml of your captured data and store it (XmlTextWriter or XmlDocument)
References to help on how to do each part:

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_23964786.html
http://www.mnot.net/cache_docs/
http://msdn.microsoft.com/en-us/library/system.xml.xmltextwriter.aspx
http://paulsiu.wordpress.com/2007/04/04/creating-a-xml-document-from-scratch-without-using-a-file-in-c/

0
 
LVL 3

Author Comment

by:eitama
ID: 26136166
Inside the HTML,

How do I know which URLs are objects that should be loaded, and which are just links which are only loaded
when a user clicks them on a browser?

If I just go around looking for "http://..." I will probably fail.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 23

Expert Comment

by:Tony McCreath
ID: 26136534
The following answer (my 4th reference) relates to finding images within a page.

http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_23964786.html

You may also be interested in other embedded objects such as included files (javascript, css), flash files etc. You will need to determine the pattern of html that is used and create regex expressions to acquire the urls from them.

It is very hard to create a system that's 100%. Google hasn't done it yet! Html can be written by websites in many ways and is often invalid. One of the hardest areas is when content is loaded from within javascript.

0
 
LVL 3

Author Comment

by:eitama
ID: 26136592
Well,

I'd like to mimic a browser,
all the objects that a browser would decide to fetch i'd like to fetch as well, excluding ajax and javascript generated GETs is fine.
I am very familiar with regexp, so that's not a problem.

The only thing i'm now missing is how to distinguish between items a browser would fetch, like images/sounds/icons/css/js/flash, and links that would not be accessed
till the user clicks them, inside the HTML.

The link you posted helps me on understanding the img tag format, but what about all the others? flash/css/js... etc...

Thanks,
Eitam.
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 26136723
There is a way to embed an IE control within .Net and control it. You may be able to use that to determine what files get loaded. That is outside my skill base though.

Telling you how to scrape all the different element types is getting a little too close to my business. You can probably find examples around the net or research in the html syntax for the tagsyour interested in. As hints you may be looking at the following tags:

img
link
script
object
frame
iframe

and their patterns will be similar to the img example. With variations of the attribute used to store the url

You will find you have to tweak the expressions as you use them on more websites so they work in a consistent manner.
0
 
LVL 3

Author Comment

by:eitama
ID: 26137037
Ok,

Thank you for your help.
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

We live in a world of interfaces like the one in the title picture. VBA also allows to use interfaces which offers a lot of possibilities. This article describes how to use interfaces in VBA and how to work around their bugs.
How do you create a user-centered user experience on your website? And what are some things you should consider in the process?
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
Starting up a Project
Suggested Courses

578 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question