Solved

How to track ALL web browser http requests of a certain url?

Posted on 2011-02-16
7
1,140 Views
Last Modified: 2013-11-05
Hi folks,

I'm trying to develop a quick software tool which (primary) purpose is to monitor the performance of the advertising placed on our website by an external ad server (by javascript).

This tool should run contineously and check a variety of pages on our website.

My aim is to simulate webbrowser behaviour in order to be able to track ALL http requests following to the first request which only gets the html source of the page.
Most of you might know tools like "Fiddler" or "Tamper Data", they're working in a way I need. I'd like to automate this and write the results into a log file or a database.

Do any of you guys have experience in developing such a tool in .NET? I tried the embedded IE webbrowser control (SHDocVw.WebBrowser) and skybound gecko, but I didn't succeed so far. Are there any tutorials about that topic in the web?
Unfortunately, any approach to google this didn't get me to a proper help or a usable documentation.

Thank you in advance,
Dani
0
Comment
Question by:kickeronline
  • 4
  • 3
7 Comments
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34905778
I think the System.Net.HttpWebRequest class provides what you need.  You can craft a http request with it (including cookies, etc) and then get the corresponding response (headers, content, etc).

Here is a sample code:
var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
var res = req.GetResponse();
var sr = new System.IO.StreamReader(res.GetResponseStream());
var html = sr.ReadToEnd();
Console.WriteLine(html);

Open in new window


More details about System.Net.HttpWebRequest:
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

I hope this helps.

0
 

Author Comment

by:kickeronline
ID: 34905849
Thanks but... no, I'm sorry. That's too simple. I certainly know how to get the source code for a url.

Regarding your example, the source code of http://www.yahoo.com/ contains lots of references to images, text files (.js, .css), videos (.swf) and banners (.jpg, .swf). It's all that kind of stuff your browser will load right AFTER it got the source code for the address.

What I want to do is to gather information about all those resources, however they get loaded: Some resources are embedded directly into the source code (pictures, stylesheets, scripts), but others get included by e.g. javascript. That's what your browser does when you open a website.

I could try to parse the source code, put that would mean on the one hand a tremendous effort and on the other hand reinventing the wheel. I'm sure there's a much better solution out there...
0
 
LVL 23

Expert Comment

by:wdosanjos
ID: 34907183
I think I got it.  

The following sample code retrieves the URL of the resources used on the page (LINK, SCRIPT, and IMG tags only).  It uses the Html Agility Pack to parse the HTML.  You should be able to easily add other tags if needed.  

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace ReadHtml
{
    class Program
    {
        static void Main(string[] args)
        {
            var tags = new Dictionary<string, string>()
            {
                {"IMG",     "SRC"},
                {"LINK",    "HREF"},
                {"SCRIPT",  "SRC"},
            };

            var resources = new List<string>();

            var req = System.Net.HttpWebRequest.Create("http://www.yahoo.com/");
            var res = req.GetResponse();
            //var sr = new System.IO.StreamReader(res.GetResponseStream());
            //var html = sr.ReadToEnd();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(res.GetResponseStream());

            doc.Save(@"c:\temp\yahoo.html");

            foreach (var node in doc.DocumentNode.Descendants())
            {
                string attrName;

                if (tags.TryGetValue(node.Name.ToUpper(), out attrName))
                {
                    var attr = node.Attributes[attrName];

                    if (attr != null)
                    {
                        resources.Add(attr.Value);
                    }
                }
            }

            foreach (var url in resources)
            {
                Console.WriteLine(url);
                // You can add your code to download and save the resource here.
            }
        }
    }
}

Open in new window

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:kickeronline
ID: 34908936
Okay, that brings me to all resources which are embedded directly into the source code of that page.

But what about the execution of javascript?
For instance, if there is a script which is executed right after the source code has been completed, which puts additional <img> tags by using

 
document.write('<img src=\'/somepath/someimage.jpg\'>');

Open in new window


That's a very common way to place advertising on a web page.
I wouldn't dare to even think about trying to solve all these cases programmatically, that's why I tried to use existing browser technologies.

Maybe there's a way to derive a web proxy class in order to attach it to some browser control...?
0
 
LVL 23

Accepted Solution

by:
wdosanjos earned 250 total points
ID: 34909466
Humm... in that case you do need to execute the page to get them and somehow capture the requests.

On IE or FF, if you just select 'File / Save' you don't get resources referenced by the JavaScript code.

If you can write a proxy and update the Web Browser Control configuration to use the proxy, you might be able to achieve what you need.  You'll need to come up with some controls to identify what page is being loaded, because the proxy normally don't get that information. (maybe through the referrer header you can infer it).

I found this Mini C# Proxy Server, maybe you could leverage it.
http://miniproxyserver.codeplex.com/

Sorry, but I don't have any other suggestions.
0
 

Assisted Solution

by:kickeronline
kickeronline earned 0 total points
ID: 34916027
This tool is a good approach to solve that problem.

I started the mini proxy server at localhost:8090 and managed to assign it as proxy address of the gecko browser, which I embedded into my Windows Form (VB.NET):

Public Class Form1
    Public Sub New()
        InitializeComponent()
        Skybound.Gecko.Xpcom.Initialize(<xmlRunnerPath>)
        setSetting("network.proxy.http", "127.0.0.1")
        setSetting("network.proxy.http_port", 8090)
        setSetting("network.proxy.share_proxy_settings", True)
        setSetting("network.proxy.type", 1)
        setSetting("network.proxy.no_proxies_on", "")
    End Sub
    Private Sub setSetting(ByVal key As String, ByVal value As Object)
        Skybound.Gecko.GeckoPreferences.Default.Item(key) = value
    End Sub
End Class

Open in new window


What I now got to do is to log the requests into a database, which should be no problem as I got the source codes for hat project. Apart from that, I'm receiving several 404/400 errors, so I assume that the software is still a bit buggy. Maybe I'm able to fix that.

I noticed that the sql lite library was already added to that project. Unfortunately, there's no documentation about it.

If you got any expirence in logging into a database with the mini proxy server, or anybody knows some well-working alternative solution, I'd appreciate any feedback.

Thanks a lot!
0
 

Author Closing Comment

by:kickeronline
ID: 34949670
The solution is to automate/script a browser, which can be accomplished by adding a browser control to some windows form and navigate to web pages repeatedly.

To trace all ressources that are getting requested by a browser, one has to use a proxy server which logs the entire traffic into a log file or a database.

wdosanjos gives a good example for an open source proxy server. Anyway, it's got to be modified to work accurate.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

A basic question.. “What is the Garbage Collector?” The usual answer given back: “Garbage collector is a background thread run by the CLR for freeing up the memory space used by the objects which are no longer used by the program.” I wondered …
Preface There are many applications where some computing systems need have their system clocks running synchronized within a small margin and eventually need to be in sync with the global time. There are different solutions for this, i.e. the W3…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now