Solved

How to track ALL web browser http requests of a certain url?

Posted on 2011-02-16
7
1,157 Views
Last Modified: 2013-11-05
Hi folks,

I'm trying to develop a quick software tool which (primary) purpose is to monitor the performance of the advertising placed on our website by an external ad server (by javascript).

This tool should run contineously and check a variety of pages on our website.

My aim is to simulate webbrowser behaviour in order to be able to track ALL http requests following to the first request which only gets the html source of the page.
Most of you might know tools like "Fiddler" or "Tamper Data", they're working in a way I need. I'd like to automate this and write the results into a log file or a database.

Do any of you guys have experience in developing such a tool in .NET? I tried the embedded IE webbrowser control (SHDocVw.WebBrowser) and skybound gecko, but I didn't succeed so far. Are there any tutorials about that topic in the web?
Unfortunately, any approach to google this didn't get me to a proper help or a usable documentation.

Thank you in advance,
Dani
0
Comment
Question by:kickeronline
  • 4
  • 3
7 Comments
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34905778
I think the System.Net.HttpWebRequest class provides what you need.  You can craft a http request with it (including cookies, etc) and then get the corresponding response (headers, content, etc).

Here is a sample code:
var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
var res = req.GetResponse();
var sr = new System.IO.StreamReader(res.GetResponseStream());
var html = sr.ReadToEnd();
Console.WriteLine(html);

Open in new window


More details about System.Net.HttpWebRequest:
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

I hope this helps.

0
 

Author Comment

by:kickeronline
ID: 34905849
Thanks but... no, I'm sorry. That's too simple. I certainly know how to get the source code for a url.

Regarding your example, the source code of http://www.yahoo.com/ contains lots of references to images, text files (.js, .css), videos (.swf) and banners (.jpg, .swf). It's all that kind of stuff your browser will load right AFTER it got the source code for the address.

What I want to do is to gather information about all those resources, however they get loaded: Some resources are embedded directly into the source code (pictures, stylesheets, scripts), but others get included by e.g. javascript. That's what your browser does when you open a website.

I could try to parse the source code, put that would mean on the one hand a tremendous effort and on the other hand reinventing the wheel. I'm sure there's a much better solution out there...
0
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34907183
I think I got it.  

The following sample code retrieves the URL of the resources used on the page (LINK, SCRIPT, and IMG tags only).  It uses the Html Agility Pack to parse the HTML.  You should be able to easily add other tags if needed.  

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace ReadHtml
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new Dictionary<string, string>()
            {
                {"IMG",     "SRC"},
                {"LINK",    "HREF"},
                {"SCRIPT",  "SRC"},
            };

            var resources = new List<string>();

            var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
            var res = req.GetResponse();
            //var sr = new System.IO.StreamReader(res.GetResponseStream());
            //var html = sr.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(res.GetResponseStream());

            doc.Save(@"c:\temp\yahoo.html");

            foreach (var node in doc.DocumentNode.Descendants())
            {
                string attrName;

                if (tags.TryGetValue(node.Name.ToUpper(), out attrName))
                {
                    var attr = node.Attributes[attrName];

                    if (attr != null)
                    {
                        resources.Add(attr.Value);
                    }
                }
            }

            foreach (var url in resources)
            {
                Console.WriteLine(url);
                // You can add your code to download and save the resource here.
            }
        }
    }
}

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:kickeronline
ID: 34908936
Okay, that brings me to all resources which are embedded directly into the source code of that page.

But what about the execution of javascript?
For instance, if there is a script which is executed right after the source code has been completed, which puts additional <img> tags by using

 
document.write('<img src=\'/somepath/someimage.jpg\'>');

Open in new window


That's a very common way to place advertising on a web page.
I wouldn't dare to even think about trying to solve all these cases programmatically, that's why I tried to use existing browser technologies.

Maybe there's a way to derive a web proxy class in order to attach it to some browser control...?
0
 
LVL 23

Accepted Solution

by:
wdosanjos earned 250 total points
ID: 34909466
Humm... in that case you do need to execute the page to get them and somehow capture the requests.

On IE or FF, if you just select 'File / Save' you don't get resources referenced by the JavaScript code.

If you can write a proxy and update the Web Browser Control configuration to use the proxy, you might be able to achieve what you need.  You'll need to come up with some controls to identify what page is being loaded, because the proxy normally don't get that information. (maybe through the referrer header you can infer it).

I found this Mini C# Proxy Server, maybe you could leverage it.
http://miniproxyserver.codeplex.com/

Sorry, but I don't have any other suggestions.
0
 

Assisted Solution

by:kickeronline
kickeronline earned 0 total points
ID: 34916027
This tool is a good approach to solve that problem.

I started the mini proxy server at localhost:8090 and managed to assign it as proxy address of the gecko browser, which I embedded into my Windows Form (VB.NET):

Public Class Form1
    Public Sub New()
        InitializeComponent()
        Skybound.Gecko.Xpcom.Initialize(<xmlRunnerPath>)
        setSetting("network.proxy.http", "127.0.0.1")
        setSetting("network.proxy.http_port", 8090)
        setSetting("network.proxy.share_proxy_settings", True)
        setSetting("network.proxy.type", 1)
        setSetting("network.proxy.no_proxies_on", "")
    End Sub
    Private Sub setSetting(ByVal key As String, ByVal value As Object)
        Skybound.Gecko.GeckoPreferences.Default.Item(key) = value
    End Sub
End Class

Open in new window


What I now got to do is to log the requests into a database, which should be no problem as I got the source codes for hat project. Apart from that, I'm receiving several 404/400 errors, so I assume that the software is still a bit buggy. Maybe I'm able to fix that.

I noticed that the sql lite library was already added to that project. Unfortunately, there's no documentation about it.

If you got any expirence in logging into a database with the mini proxy server, or anybody knows some well-working alternative solution, I'd appreciate any feedback.

Thanks a lot!
0
 

Author Closing Comment

by:kickeronline
ID: 34949670
The solution is to automate/script a browser, which can be accomplished by adding a browser control to some windows form and navigate to web pages repeatedly.

To trace all ressources that are getting requested by a browser, one has to use a proxy server which logs the entire traffic into a log file or a database.

wdosanjos gives a good example for an open source proxy server. Anyway, it's got to be modified to work accurate.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IP addresses can be stored in a database in any of several ways.  These ways may vary based on the volume of the data.  I was dealing with quite a large amount of data for user authentication purpose, and needed a way to minimize the storage.   …
Many of us here at EE write code. Many of us write exceptional code; just as many of us write exception-prone code. As we all should know, exceptions are a mechanism for handling errors which are typically out of our control. From database errors, t…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question