Solved

How to track ALL web browser http requests of a certain url?

Posted on 2011-02-16
7
1,151 Views
Last Modified: 2013-11-05
Hi folks,

I'm trying to develop a quick software tool which (primary) purpose is to monitor the performance of the advertising placed on our website by an external ad server (by javascript).

This tool should run contineously and check a variety of pages on our website.

My aim is to simulate webbrowser behaviour in order to be able to track ALL http requests following to the first request which only gets the html source of the page.
Most of you might know tools like "Fiddler" or "Tamper Data", they're working in a way I need. I'd like to automate this and write the results into a log file or a database.

Do any of you guys have experience in developing such a tool in .NET? I tried the embedded IE webbrowser control (SHDocVw.WebBrowser) and skybound gecko, but I didn't succeed so far. Are there any tutorials about that topic in the web?
Unfortunately, any approach to google this didn't get me to a proper help or a usable documentation.

Thank you in advance,
Dani
0
Comment
Question by:kickeronline
  • 4
  • 3
7 Comments
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34905778
I think the System.Net.HttpWebRequest class provides what you need.  You can craft a http request with it (including cookies, etc) and then get the corresponding response (headers, content, etc).

Here is a sample code:
var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
var res = req.GetResponse();
var sr = new System.IO.StreamReader(res.GetResponseStream());
var html = sr.ReadToEnd();
Console.WriteLine(html);

Open in new window


More details about System.Net.HttpWebRequest:
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

I hope this helps.

0
 

Author Comment

by:kickeronline
ID: 34905849
Thanks but... no, I'm sorry. That's too simple. I certainly know how to get the source code for a url.

Regarding your example, the source code of http://www.yahoo.com/ contains lots of references to images, text files (.js, .css), videos (.swf) and banners (.jpg, .swf). It's all that kind of stuff your browser will load right AFTER it got the source code for the address.

What I want to do is to gather information about all those resources, however they get loaded: Some resources are embedded directly into the source code (pictures, stylesheets, scripts), but others get included by e.g. javascript. That's what your browser does when you open a website.

I could try to parse the source code, put that would mean on the one hand a tremendous effort and on the other hand reinventing the wheel. I'm sure there's a much better solution out there...
0
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34907183
I think I got it.  

The following sample code retrieves the URL of the resources used on the page (LINK, SCRIPT, and IMG tags only).  It uses the Html Agility Pack to parse the HTML.  You should be able to easily add other tags if needed.  

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace ReadHtml
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new Dictionary<string, string>()
            {
                {"IMG",     "SRC"},
                {"LINK",    "HREF"},
                {"SCRIPT",  "SRC"},
            };

            var resources = new List<string>();

            var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
            var res = req.GetResponse();
            //var sr = new System.IO.StreamReader(res.GetResponseStream());
            //var html = sr.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(res.GetResponseStream());

            doc.Save(@"c:\temp\yahoo.html");

            foreach (var node in doc.DocumentNode.Descendants())
            {
                string attrName;

                if (tags.TryGetValue(node.Name.ToUpper(), out attrName))
                {
                    var attr = node.Attributes[attrName];

                    if (attr != null)
                    {
                        resources.Add(attr.Value);
                    }
                }
            }

            foreach (var url in resources)
            {
                Console.WriteLine(url);
                // You can add your code to download and save the resource here.
            }
        }
    }
}

Open in new window

0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 

Author Comment

by:kickeronline
ID: 34908936
Okay, that brings me to all resources which are embedded directly into the source code of that page.

But what about the execution of javascript?
For instance, if there is a script which is executed right after the source code has been completed, which puts additional <img> tags by using

 
document.write('<img src=\'/somepath/someimage.jpg\'>');

Open in new window


That's a very common way to place advertising on a web page.
I wouldn't dare to even think about trying to solve all these cases programmatically, that's why I tried to use existing browser technologies.

Maybe there's a way to derive a web proxy class in order to attach it to some browser control...?
0
 
LVL 23

Accepted Solution

by:
wdosanjos earned 250 total points
ID: 34909466
Humm... in that case you do need to execute the page to get them and somehow capture the requests.

On IE or FF, if you just select 'File / Save' you don't get resources referenced by the JavaScript code.

If you can write a proxy and update the Web Browser Control configuration to use the proxy, you might be able to achieve what you need.  You'll need to come up with some controls to identify what page is being loaded, because the proxy normally don't get that information. (maybe through the referrer header you can infer it).

I found this Mini C# Proxy Server, maybe you could leverage it.
http://miniproxyserver.codeplex.com/

Sorry, but I don't have any other suggestions.
0
 

Assisted Solution

by:kickeronline
kickeronline earned 0 total points
ID: 34916027
This tool is a good approach to solve that problem.

I started the mini proxy server at localhost:8090 and managed to assign it as proxy address of the gecko browser, which I embedded into my Windows Form (VB.NET):

Public Class Form1
    Public Sub New()
        InitializeComponent()
        Skybound.Gecko.Xpcom.Initialize(<xmlRunnerPath>)
        setSetting("network.proxy.http", "127.0.0.1")
        setSetting("network.proxy.http_port", 8090)
        setSetting("network.proxy.share_proxy_settings", True)
        setSetting("network.proxy.type", 1)
        setSetting("network.proxy.no_proxies_on", "")
    End Sub
    Private Sub setSetting(ByVal key As String, ByVal value As Object)
        Skybound.Gecko.GeckoPreferences.Default.Item(key) = value
    End Sub
End Class

Open in new window


What I now got to do is to log the requests into a database, which should be no problem as I got the source codes for hat project. Apart from that, I'm receiving several 404/400 errors, so I assume that the software is still a bit buggy. Maybe I'm able to fix that.

I noticed that the sql lite library was already added to that project. Unfortunately, there's no documentation about it.

If you got any expirence in logging into a database with the mini proxy server, or anybody knows some well-working alternative solution, I'd appreciate any feedback.

Thanks a lot!
0
 

Author Closing Comment

by:kickeronline
ID: 34949670
The solution is to automate/script a browser, which can be accomplished by adding a browser control to some windows form and navigate to web pages repeatedly.

To trace all ressources that are getting requested by a browser, one has to use a proxy server which logs the entire traffic into a log file or a database.

wdosanjos gives a good example for an open source proxy server. Anyway, it's got to be modified to work accurate.
0

Featured Post

Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Welcome my friends to the second instalment and follow-up to our Minify and Concatenate Your Scripts and Stylesheets (http://www.experts-exchange.com/Programming/Languages/.NET/ASP.NET/A_4334-Minify-and-Concatenate-Your-Scripts-and-Stylesheets.html)…
In my previous article (http://www.experts-exchange.com/Programming/Languages/.NET/.NET_Framework_3.x/A_4362-Serialization-in-NET-1.html) we saw the basics of serialization and how types/objects can be serialized to Binary format. In this blog we wi…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question