[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1796
  • Last Modified:

How to get dynamic web page content in C#?

I'm programming a Windows Forms based web crawler which should do the following:

1) Start from a URL defined by user (for example www.microsoft.com)
2) Download the content of that page and scrape specific data (strings)
3) After going through page content, add all the data found into an existing database
4) Find all the links on the page, select 5 of them and create new crawlers to crawl each of them.
5) Each crawler starts again from stage 1 with their unique URLs.

Now the problem is that when I download a webpage using WebClient-, HttpWebRequest-, or WebResponse class (all in System.Net) I only get the static content of the web page. Most websites contain scripts, php code and other dynamic content and I can't see them with these classes.

Simply:
Let's say I have a php page www.example.com/page.php and when shown in web browser it gets 100 names from a database and prints them on the page. I want to be able to read that dynamic content in my Windows Forms application using C#.

This is just an example, the page could be for example ASP.NET and contain news headlines or something like that. I can't define what URL's the users will scrape so I really have to be able to read static and dynamic content from any URL.

NOTICE! The problem is clearly stated above, I don't need help in implementing any other features listed on top of this question :)

Thanks!
0
SubsonicDesignOfficial
Asked:
SubsonicDesignOfficial
  • 2
1 Solution
 
Dave BaldwinFixer of ProblemsCommented:
Your example actually puts up 'static' content in that it will usually be included in the original page as HTML.  What you will have a problem with is the content loaded by javascript/AJAX methods.  You will have to find a way to read the javascript and make the requests that it does.
0
 
SubsonicDesignOfficialAuthor Commented:
Thanks for your answer! Now that I recall I only saw some javascript regions unformatted (they were shown as code). Now I also remember that PHP and ASP.NET are translated to HTML at the server side (?), I should have thought of that!
0
 
Dave BaldwinFixer of ProblemsCommented:
Thanks for the points.  The other thing that will be a problem is Flash of course since it often loads it's own content.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now