Solved

How to get dynamic web page content in C#?

Posted on 2011-03-24
3
1,565 Views
Last Modified: 2012-05-11
I'm programming a Windows Forms based web crawler which should do the following:

1) Start from a URL defined by user (for example www.microsoft.com)
2) Download the content of that page and scrape specific data (strings)
3) After going through page content, add all the data found into an existing database
4) Find all the links on the page, select 5 of them and create new crawlers to crawl each of them.
5) Each crawler starts again from stage 1 with their unique URLs.

Now the problem is that when I download a webpage using WebClient-, HttpWebRequest-, or WebResponse class (all in System.Net) I only get the static content of the web page. Most websites contain scripts, php code and other dynamic content and I can't see them with these classes.

Simply:
Let's say I have a php page www.example.com/page.php and when shown in web browser it gets 100 names from a database and prints them on the page. I want to be able to read that dynamic content in my Windows Forms application using C#.

This is just an example, the page could be for example ASP.NET and contain news headlines or something like that. I can't define what URL's the users will scrape so I really have to be able to read static and dynamic content from any URL.

NOTICE! The problem is clearly stated above, I don't need help in implementing any other features listed on top of this question :)

Thanks!
0
Comment
  • 2
3 Comments
 
LVL 82

Accepted Solution

by:
Dave Baldwin earned 500 total points
ID: 35211654
Your example actually puts up 'static' content in that it will usually be included in the original page as HTML.  What you will have a problem with is the content loaded by javascript/AJAX methods.  You will have to find a way to read the javascript and make the requests that it does.
0
 

Author Closing Comment

by:SubsonicDesignOfficial
ID: 35211687
Thanks for your answer! Now that I recall I only saw some javascript regions unformatted (they were shown as code). Now I also remember that PHP and ASP.NET are translated to HTML at the server side (?), I should have thought of that!
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 35211707
Thanks for the points.  The other thing that will be a problem is Flash of course since it often loads it's own content.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now