Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Html Screen Scrapping

Posted on 2008-11-10
5
Medium Priority
?
430 Views
Last Modified: 2012-08-13
Hi All ,

Please help me out of this prob I want to write the code for the application which can extract the html of any web site say "http://search.techrepublic.com.com/search/screen-scraper.html"  (just u can say when we view the source for any website) in that i am searching for the specific data . Any Tutorial,or code, or any tool that can search for that data will help me a lot .

Thanks in advance  
0
Comment
Question by:ASINGH1974
  • 2
4 Comments
 
LVL 7

Accepted Solution

by:
aherps earned 1000 total points
ID: 22920160

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using System.Windows.Forms;
 
namespace WebHelper
{
    public class webpage
    {
        public string results;
        public webpage(string address)
        {
            string strResult = "";
 
            WebResponse objResponse;
            WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
            objResponse = objRequest.GetResponse();
 
            using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
            {
                strResult = sr.ReadToEnd();
                // Close and clean up the StreamReader
                sr.Close();
            }
 
            this.results = strResult;
        }
    }
}

Open in new window

0
 
LVL 7

Expert Comment

by:aherps
ID: 22920173
Just be warned with the above - this wont work with webpages that use AJAX as the result is taken on the initial load.  Not the subsequent data
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 22926672
If you have access to PHP, it's very easy.  Best, ~Ray
<?php
$html = file_get_contents('http://yoursite.org/page.asp');
echo htmlentities($html);
?>

Open in new window

0
 
LVL 6

Expert Comment

by:Neeraj Soni
ID: 22938349
The code from aherps is perhaps thestart point to begin with. 
All you need is to write a custom parser for html and identify your landmark tags in html source. From these tang you can read the inner html or text, attribute and other values.
Even you can manipulate ajax calls by identifying their url and attempt to download partial data from those urls.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Is your Office 365 signature not working the way you want it to? Are signature updates taking up too much of your time? Let's run through the most common problems that an IT administrator can encounter when dealing with Office 365 email signatures.
Hello there! As a developer I have modified and refactored the unit tests which was written by fellow developers in the past. On the course, I have gone through various misconceptions and technical challenges when it comes to implementation. I would…
In this tutorial viewers will learn how add a scalable full-width header using CSS3. Create a new HTML document with an internal stylesheet. Set a tiled background.:  Create a new div and name it Header. Position it with position:absolute at the top…
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question