Solved

How to get dynamic web page content in C#?

Posted on 2011-03-24
3
1,589 Views
Last Modified: 2012-05-11
I'm programming a Windows Forms based web crawler which should do the following:

1) Start from a URL defined by user (for example www.microsoft.com)
2) Download the content of that page and scrape specific data (strings)
3) After going through page content, add all the data found into an existing database
4) Find all the links on the page, select 5 of them and create new crawlers to crawl each of them.
5) Each crawler starts again from stage 1 with their unique URLs.

Now the problem is that when I download a webpage using WebClient-, HttpWebRequest-, or WebResponse class (all in System.Net) I only get the static content of the web page. Most websites contain scripts, php code and other dynamic content and I can't see them with these classes.

Simply:
Let's say I have a php page www.example.com/page.php and when shown in web browser it gets 100 names from a database and prints them on the page. I want to be able to read that dynamic content in my Windows Forms application using C#.

This is just an example, the page could be for example ASP.NET and contain news headlines or something like that. I can't define what URL's the users will scrape so I really have to be able to read static and dynamic content from any URL.

NOTICE! The problem is clearly stated above, I don't need help in implementing any other features listed on top of this question :)

Thanks!
0
Comment
  • 2
3 Comments
 
LVL 83

Accepted Solution

by:
Dave Baldwin earned 500 total points
ID: 35211654
Your example actually puts up 'static' content in that it will usually be included in the original page as HTML.  What you will have a problem with is the content loaded by javascript/AJAX methods.  You will have to find a way to read the javascript and make the requests that it does.
0
 

Author Closing Comment

by:SubsonicDesignOfficial
ID: 35211687
Thanks for your answer! Now that I recall I only saw some javascript regions unformatted (they were shown as code). Now I also remember that PHP and ASP.NET are translated to HTML at the server side (?), I should have thought of that!
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 35211707
Thanks for the points.  The other thing that will be a problem is Flash of course since it often loads it's own content.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Ivo
C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
This article introduced a TextBox that supports transparent background.   Introduction TextBox is the most widely used control component in GUI design. Most GUI controls do not support transparent background and more or less do not have the…
This is used to tweak the memory usage for your computer, it is used for servers more so than workstations but just be careful editing registry settings as it may cause irreversible results. I hold no responsibility for anything you do to the regist…
In this video I am going to show you how to back up and restore Office 365 mailboxes using CodeTwo Backup for Office 365. Learn more about the tool used in this video here: http://www.codetwo.com/backup-for-office-365/ (http://www.codetwo.com/ba…

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

26 Experts available now in Live!

Get 1:1 Help Now