?
Solved

ASP.NET 2.0 Screen Scraper

Posted on 2008-06-17
11
Medium Priority
?
2,366 Views
Last Modified: 2013-11-07
Hello,
I wanted to try to develop a screen scraper I can use for a client but I am having some issues. Before I buy one I thought I should post the question here.  I need this solution as soon as possible so any help would be appreciated!  I need to go to various court websites and scrape the case data.  I then need to be able to write the data to a csv file.  I will need to continue to add data to the same file.  Can anyone help.  On the web, I was told this is an easy process but all I am getting is a copy of the screen and data.  I just need the data to be written to a csv file.  This solution needs to be in ASP.NET/C#.NET.

Thank you in advance!
Miracle By Design
public partial class _Default : System.Web.UI.Page
    {
        String r;
 
        protected void Page_Load(object sender, EventArgs e)
        {
            string str = "http://www.clerk-alachua-fl.org/pa/pa.urd/pamw2000*o_case_sum?83518636"; 
            home.Text = screenscrape(str);
        }
 
        private string screenscrape(string url)
        {
            WebResponse obj;
            WebRequest obj1 = System.Net.HttpWebRequest.Create(url);
            obj = obj1.GetResponse();
            using (StreamReader sr = new StreamReader(obj.GetResponseStream()))
            {
                r = sr.ReadToEnd();
                sr.Close();
            }
            return r;
            
            gvResults.DataSource = r;
            // binds the databind
            gvResults.DataBind();
 
            // The following lines of code writes the extracted Urls to the file named test.txt
            StreamWriter sw = new StreamWriter(Server.MapPath("AlachuaCoFLCircuitCourt.csv"));
            sw.Write(r);
            sw.Close(); 
 
 
        }
        
    }
}

Open in new window

0
Comment
Question by:MiracleByDesign
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
11 Comments
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21809059
Not sure what gvResutls is since it's a partial class.

Anyways though, the code that it looks like you put in would go out and get the HTML source of the page and read it into the string "r", then write it to a csv.

Problem though is that html source is not the same as csv...  if you want it in csv you're going to have to parse the html source (i.e. by using Regular Expressions), or do it a different way and do iterate through HTML Document Object Model (DOM) for the page.
0
 

Author Comment

by:MiracleByDesign
ID: 21809627
gvResults is a datagrid that I thought I could right the data to and then maybe export it to a csv file.  Can you give me a coding example of how to solve this problem?
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21810459
Does the information actually populate into the DataGridView (I imagine not but might as well confirm), if it does then I can get you from there to the file.
0
Application Discovery Service in AWS

In the era of the cloud, customers migrating away from their existing on-premise infrastructure. This requires lots of planning, strategies, and effort to identify their existing resources and determine how best to migrate.  Datacenter migrations happen in four phases -

 

Author Comment

by:MiracleByDesign
ID: 21819149
No, it does not populate into the grid.  That is my main problem.  I have exported data from a grid to Excel before but I am new to scraping a website for data only.
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21824831
I'll be able to give you some help on this but it'll take some work since I'll have to do parsing on the html.  I'll send up something to show pretty soon.
0
 
LVL 3

Accepted Solution

by:
BitRunner303 earned 2000 total points
ID: 21825473
Here we go, I did this as WinForm but you should be able to easily tailor it to your needs.

Take a look at the 2 files in attached code, Form1.Designer.cs (the designer code) and Form1.cs (the actual code).  

Essentially the Form1.Designer.cs is just a standard form with a DataGridView and a DataSet, but also note that I don't set the DataMember at design-time (I had problems when trying to do that).

The Form1.cs is where the magic happens.  In this particular case I'm going to emphasize that we're EXTREMELY fortunate that the designers of the website put in html INPUT HIDDEN tags that show the data elements described within the page, in that way we can simply parse for any of those hidden tags and grab out the data that we need without the kind of headache you'd normally see in a web page scrape (it's a shame more people don't do this practice).  At any rate, so here's what I do is I've created a Regular Expression to search for those tags, and grab out the Name and Value of each element.  Then I dynamically put them into a DataSet and bind to my DataGridView and voila.

Now, here are some caveats you're going to have to watch out for...  you've only given me one example case, the question comes down to if you're trying to do this in batch how do you deal with cases where the elements are different...  say when you've got more than one Defendant.  That might take a considerably longer amount of work.  In essence the best way I could see it happening is to go through all cases in the set of batch criteria and find out all the unique field names that are out there, and create one big manual table of all field names.  That's a bit of a manual process but not too extreme, and the method I'm describing here is not the only way to tackle that problem.
-----Form1.Designer.cs---------------
namespace RegexTest
{
    partial class Form1
    {
        /// <summary>
        /// Required designer variable.
        /// </summary>
        private System.ComponentModel.IContainer components = null;
 
        /// <summary>
        /// Clean up any resources being used.
        /// </summary>
        /// <param name="disposing">true if managed resources should be disposed; otherwise, false.</param>
        protected override void Dispose(bool disposing)
        {
            if (disposing && (components != null))
            {
                components.Dispose();
            }
            base.Dispose(disposing);
        }
 
        #region Windows Form Designer generated code
 
        /// <summary>
        /// Required method for Designer support - do not modify
        /// the contents of this method with the code editor.
        /// </summary>
        private void InitializeComponent()
        {
            this.dataSet1 = new System.Data.DataSet();
            this.dataGridView1 = new System.Windows.Forms.DataGridView();
            ((System.ComponentModel.ISupportInitialize)(this.dataSet1)).BeginInit();
            ((System.ComponentModel.ISupportInitialize)(this.dataGridView1)).BeginInit();
            this.SuspendLayout();
            // 
            // dataSet1
            // 
            this.dataSet1.DataSetName = "NewDataSet";
            // 
            // dataGridView1
            // 
            this.dataGridView1.Anchor = ((System.Windows.Forms.AnchorStyles)((((System.Windows.Forms.AnchorStyles.Top | System.Windows.Forms.AnchorStyles.Bottom)
                        | System.Windows.Forms.AnchorStyles.Left)
                        | System.Windows.Forms.AnchorStyles.Right)));
            this.dataGridView1.ColumnHeadersHeightSizeMode = System.Windows.Forms.DataGridViewColumnHeadersHeightSizeMode.AutoSize;
            this.dataGridView1.Location = new System.Drawing.Point(12, 12);
            this.dataGridView1.Name = "dataGridView1";
            this.dataGridView1.Size = new System.Drawing.Size(616, 242);
            this.dataGridView1.TabIndex = 1;
            // 
            // Form1
            // 
            this.AutoScaleDimensions = new System.Drawing.SizeF(6F, 13F);
            this.AutoScaleMode = System.Windows.Forms.AutoScaleMode.Font;
            this.ClientSize = new System.Drawing.Size(640, 266);
            this.Controls.Add(this.dataGridView1);
            this.Name = "Form1";
            this.Text = "Form1";
            this.Load += new System.EventHandler(this.Form1_Load);
            ((System.ComponentModel.ISupportInitialize)(this.dataSet1)).EndInit();
            ((System.ComponentModel.ISupportInitialize)(this.dataGridView1)).EndInit();
            this.ResumeLayout(false);
 
        }
 
        #endregion
 
        private System.Data.DataSet dataSet1;
        private System.Windows.Forms.DataGridView dataGridView1;
 
    }
}
 
-----Form1.cs---------------------
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Net;
using System.Text.RegularExpressions;
using System.IO;
 
namespace RegexTest
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
 
        private void Form1_Load(object sender, EventArgs e)
        {
            string sourceFile = GetSourceForPage("http://www.clerk-alachua-fl.org/pa/pa.urd/pamw2000*o_case_sum?83518636");
            GetFieldsInPage(sourceFile);
        }
 
        private string GetSourceForPage(string url)
        {
            HttpWebRequest myreq = (HttpWebRequest)HttpWebRequest.Create(url);
            StreamReader r = new StreamReader(myreq.GetResponse().GetResponseStream());
            string tmpStr = r.ReadToEnd();
            r.Close();
            return tmpStr;
        }
 
        private void GetFieldsInPage(string htmlSource)
        {
            this.dataSet1.Clear();
            DataTable myTable = new DataTable();
 
            List<string> NameList = new List<string>();
            List<string> ValueList = new List<string>();
 
            Regex myReg = new Regex(@"\<INPUT\ TYPE\=HIDDEN\ NAME=\""(?<VariableName>[^\""]+)\""\ VALUE=\""(?<VariableValue>[^\""]+)\""[^\>]*\>",RegexOptions.IgnoreCase | RegexOptions.Multiline);
            MatchCollection myMatches = myReg.Matches(htmlSource);
            foreach (Match myMatch in myMatches)
            {
                string varName = myMatch.Groups["VariableName"].Value;
                string varValue = myMatch.Groups["VariableValue"].Value;
                varName = varName.Replace('.', '_');
 
                NameList.Add(varName);
                ValueList.Add(varValue);
            }
 
            myTable.TableName = "ScrapeData";
 
            for (int z = 0; z < NameList.Count; z++)
            {
                DataColumn newCol = myTable.Columns.Add();
                newCol.AllowDBNull = true;
                newCol.DataType = typeof(string);
                newCol.ColumnName = NameList[z];
                newCol.MaxLength = 32767;
            }
 
            DataRow myRow = myTable.NewRow();
 
            for (int z = 0; z < ValueList.Count; z++)
            {
                myRow[z] = ValueList[z];
            }
 
            myTable.Rows.Add(myRow);
 
            this.dataSet1.Tables.Add(myTable);
            this.dataGridView1.DataSource = this.dataSet1.Tables[0].DefaultView;
 
        }
    }
}

Open in new window

0
 

Author Comment

by:MiracleByDesign
ID: 21825948
I will add this code to my project tonight and let you know what happens.  Thank you very much for your help with this project.
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21826208
No prob let me know how it goes.
0
 

Author Comment

by:MiracleByDesign
ID: 21912123
BitRunner303-The solution works but the client has another request that I am not sure you can help me with. I need to save the actual scraper to an XML file so it can be used in another program.  Do you have any idea how I would do this?

thanks,
MiracleByDesign
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21921599
Simple.  Here's a tutorial on using the XML writer features in .NET: http://www.c-sharpcorner.com/UploadFile/mahesh/writexmlusingXmlWriter11132005233450PM/writexmlusingXmlWriter.aspx

You'll simply iterate through the rows your final DataSet, writing out the elements of each record.  I would probably do it something like so (snippet).

Basically you'd write a root element for the file, that I've called CaseFile here.  Then for each Case write out all the XmlElements.  If you need some more help with this let me know but this should get you going.
<CaseFile>
  <Case>
      <DefendantName>John Doe</DefendantName>
      <ProsecutingAttorney>Jack McCoy</ProsecutingAttorney>
  </Case>
</CaseFile>

Open in new window

0
 

Author Closing Comment

by:MiracleByDesign
ID: 31468193
Thank you very much!!!!
0

Featured Post

DFW AZURE MEETUP TONIGHT FRI 6PM

We will be discussing what Azure Stack is, how does it fit into the suit of offerings that Azure has currently, and where can it fit into your organizations technology stack. We will also be discussing limitations of the platform while covering various applicable scenarios.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
NetCrunch network monitor is a highly extensive platform for network monitoring and alert generation. In this video you'll see a live demo of NetCrunch with most notable features explained in a walk-through manner. You'll also get to know the philos…
In this video you will find out how to export Office 365 mailboxes using the built in eDiscovery tool. Bear in mind that although this method might be useful in some cases, using PST files as Office 365 backup is troublesome in a long run (more on t…
Suggested Courses

801 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question