[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1213
  • Last Modified:

How to track ALL web browser http requests of a certain url?

Hi folks,

I'm trying to develop a quick software tool which (primary) purpose is to monitor the performance of the advertising placed on our website by an external ad server (by javascript).

This tool should run contineously and check a variety of pages on our website.

My aim is to simulate webbrowser behaviour in order to be able to track ALL http requests following to the first request which only gets the html source of the page.
Most of you might know tools like "Fiddler" or "Tamper Data", they're working in a way I need. I'd like to automate this and write the results into a log file or a database.

Do any of you guys have experience in developing such a tool in .NET? I tried the embedded IE webbrowser control (SHDocVw.WebBrowser) and skybound gecko, but I didn't succeed so far. Are there any tutorials about that topic in the web?
Unfortunately, any approach to google this didn't get me to a proper help or a usable documentation.

Thank you in advance,
Dani
0
kickeronline
Asked:
kickeronline
  • 4
  • 3
2 Solutions
 
wdosanjosCommented:
I think the System.Net.HttpWebRequest class provides what you need.  You can craft a http request with it (including cookies, etc) and then get the corresponding response (headers, content, etc).

Here is a sample code:
var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
var res = req.GetResponse();
var sr = new System.IO.StreamReader(res.GetResponseStream());
var html = sr.ReadToEnd();
Console.WriteLine(html);

Open in new window


More details about System.Net.HttpWebRequest:
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

I hope this helps.

0
 
kickeronlineAuthor Commented:
Thanks but... no, I'm sorry. That's too simple. I certainly know how to get the source code for a url.

Regarding your example, the source code of http://www.yahoo.com/ contains lots of references to images, text files (.js, .css), videos (.swf) and banners (.jpg, .swf). It's all that kind of stuff your browser will load right AFTER it got the source code for the address.

What I want to do is to gather information about all those resources, however they get loaded: Some resources are embedded directly into the source code (pictures, stylesheets, scripts), but others get included by e.g. javascript. That's what your browser does when you open a website.

I could try to parse the source code, put that would mean on the one hand a tremendous effort and on the other hand reinventing the wheel. I'm sure there's a much better solution out there...
0
 
wdosanjosCommented:
I think I got it.  

The following sample code retrieves the URL of the resources used on the page (LINK, SCRIPT, and IMG tags only).  It uses the Html Agility Pack to parse the HTML.  You should be able to easily add other tags if needed.  

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace ReadHtml
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new Dictionary<string, string>()
            {
                {"IMG",     "SRC"},
                {"LINK",    "HREF"},
                {"SCRIPT",  "SRC"},
            };

            var resources = new List<string>();

            var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
            var res = req.GetResponse();
            //var sr = new System.IO.StreamReader(res.GetResponseStream());
            //var html = sr.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(res.GetResponseStream());

            doc.Save(@"c:\temp\yahoo.html");

            foreach (var node in doc.DocumentNode.Descendants())
            {
                string attrName;

                if (tags.TryGetValue(node.Name.ToUpper(), out attrName))
                {
                    var attr = node.Attributes[attrName];

                    if (attr != null)
                    {
                        resources.Add(attr.Value);
                    }
                }
            }

            foreach (var url in resources)
            {
                Console.WriteLine(url);
                // You can add your code to download and save the resource here.
            }
        }
    }
}

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
kickeronlineAuthor Commented:
Okay, that brings me to all resources which are embedded directly into the source code of that page.

But what about the execution of javascript?
For instance, if there is a script which is executed right after the source code has been completed, which puts additional <img> tags by using

 
document.write('<img src=\'/somepath/someimage.jpg\'>');

Open in new window


That's a very common way to place advertising on a web page.
I wouldn't dare to even think about trying to solve all these cases programmatically, that's why I tried to use existing browser technologies.

Maybe there's a way to derive a web proxy class in order to attach it to some browser control...?
0
 
wdosanjosCommented:
Humm... in that case you do need to execute the page to get them and somehow capture the requests.

On IE or FF, if you just select 'File / Save' you don't get resources referenced by the JavaScript code.

If you can write a proxy and update the Web Browser Control configuration to use the proxy, you might be able to achieve what you need.  You'll need to come up with some controls to identify what page is being loaded, because the proxy normally don't get that information. (maybe through the referrer header you can infer it).

I found this Mini C# Proxy Server, maybe you could leverage it.
http://miniproxyserver.codeplex.com/

Sorry, but I don't have any other suggestions.
0
 
kickeronlineAuthor Commented:
This tool is a good approach to solve that problem.

I started the mini proxy server at localhost:8090 and managed to assign it as proxy address of the gecko browser, which I embedded into my Windows Form (VB.NET):

Public Class Form1
    Public Sub New()
        InitializeComponent()
        Skybound.Gecko.Xpcom.Initialize(<xmlRunnerPath>)
        setSetting("network.proxy.http", "127.0.0.1")
        setSetting("network.proxy.http_port", 8090)
        setSetting("network.proxy.share_proxy_settings", True)
        setSetting("network.proxy.type", 1)
        setSetting("network.proxy.no_proxies_on", "")
    End Sub
    Private Sub setSetting(ByVal key As String, ByVal value As Object)
        Skybound.Gecko.GeckoPreferences.Default.Item(key) = value
    End Sub
End Class

Open in new window


What I now got to do is to log the requests into a database, which should be no problem as I got the source codes for hat project. Apart from that, I'm receiving several 404/400 errors, so I assume that the software is still a bit buggy. Maybe I'm able to fix that.

I noticed that the sql lite library was already added to that project. Unfortunately, there's no documentation about it.

If you got any expirence in logging into a database with the mini proxy server, or anybody knows some well-working alternative solution, I'd appreciate any feedback.

Thanks a lot!
0
 
kickeronlineAuthor Commented:
The solution is to automate/script a browser, which can be accomplished by adding a browser control to some windows form and navigate to web pages repeatedly.

To trace all ressources that are getting requested by a browser, one has to use a proxy server which logs the entire traffic into a log file or a database.

wdosanjos gives a good example for an open source proxy server. Anyway, it's got to be modified to work accurate.
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now