Solved

ASP.NET 2.0 Screen Scraper

Posted on 2008-06-17
11
2,358 Views
Last Modified: 2013-11-07
Hello,
I wanted to try to develop a screen scraper I can use for a client but I am having some issues. Before I buy one I thought I should post the question here.  I need this solution as soon as possible so any help would be appreciated!  I need to go to various court websites and scrape the case data.  I then need to be able to write the data to a csv file.  I will need to continue to add data to the same file.  Can anyone help.  On the web, I was told this is an easy process but all I am getting is a copy of the screen and data.  I just need the data to be written to a csv file.  This solution needs to be in ASP.NET/C#.NET.

Thank you in advance!
Miracle By Design
public partial class _Default : System.Web.UI.Page

    {

        String r;
 

        protected void Page_Load(object sender, EventArgs e)

        {

            string str = "http://www.clerk-alachua-fl.org/pa/pa.urd/pamw2000*o_case_sum?83518636"; 

            home.Text = screenscrape(str);

        }
 

        private string screenscrape(string url)

        {

            WebResponse obj;

            WebRequest obj1 = System.Net.HttpWebRequest.Create(url);

            obj = obj1.GetResponse();

            using (StreamReader sr = new StreamReader(obj.GetResponseStream()))

            {

                r = sr.ReadToEnd();

                sr.Close();

            }

            return r;

            

            gvResults.DataSource = r;

            // binds the databind

            gvResults.DataBind();
 

            // The following lines of code writes the extracted Urls to the file named test.txt

            StreamWriter sw = new StreamWriter(Server.MapPath("AlachuaCoFLCircuitCourt.csv"));

            sw.Write(r);

            sw.Close(); 
 
 

        }

        

    }

}

Open in new window

0
Comment
Question by:MiracleByDesign
  • 6
  • 5
11 Comments
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21809059
Not sure what gvResutls is since it's a partial class.

Anyways though, the code that it looks like you put in would go out and get the HTML source of the page and read it into the string "r", then write it to a csv.

Problem though is that html source is not the same as csv...  if you want it in csv you're going to have to parse the html source (i.e. by using Regular Expressions), or do it a different way and do iterate through HTML Document Object Model (DOM) for the page.
0
 

Author Comment

by:MiracleByDesign
ID: 21809627
gvResults is a datagrid that I thought I could right the data to and then maybe export it to a csv file.  Can you give me a coding example of how to solve this problem?
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21810459
Does the information actually populate into the DataGridView (I imagine not but might as well confirm), if it does then I can get you from there to the file.
0
 

Author Comment

by:MiracleByDesign
ID: 21819149
No, it does not populate into the grid.  That is my main problem.  I have exported data from a grid to Excel before but I am new to scraping a website for data only.
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21824831
I'll be able to give you some help on this but it'll take some work since I'll have to do parsing on the html.  I'll send up something to show pretty soon.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 3

Accepted Solution

by:
BitRunner303 earned 500 total points
ID: 21825473
Here we go, I did this as WinForm but you should be able to easily tailor it to your needs.

Take a look at the 2 files in attached code, Form1.Designer.cs (the designer code) and Form1.cs (the actual code).  

Essentially the Form1.Designer.cs is just a standard form with a DataGridView and a DataSet, but also note that I don't set the DataMember at design-time (I had problems when trying to do that).

The Form1.cs is where the magic happens.  In this particular case I'm going to emphasize that we're EXTREMELY fortunate that the designers of the website put in html INPUT HIDDEN tags that show the data elements described within the page, in that way we can simply parse for any of those hidden tags and grab out the data that we need without the kind of headache you'd normally see in a web page scrape (it's a shame more people don't do this practice).  At any rate, so here's what I do is I've created a Regular Expression to search for those tags, and grab out the Name and Value of each element.  Then I dynamically put them into a DataSet and bind to my DataGridView and voila.

Now, here are some caveats you're going to have to watch out for...  you've only given me one example case, the question comes down to if you're trying to do this in batch how do you deal with cases where the elements are different...  say when you've got more than one Defendant.  That might take a considerably longer amount of work.  In essence the best way I could see it happening is to go through all cases in the set of batch criteria and find out all the unique field names that are out there, and create one big manual table of all field names.  That's a bit of a manual process but not too extreme, and the method I'm describing here is not the only way to tackle that problem.
-----Form1.Designer.cs---------------

namespace RegexTest

{

    partial class Form1

    {

        /// <summary>

        /// Required designer variable.

        /// </summary>

        private System.ComponentModel.IContainer components = null;
 

        /// <summary>

        /// Clean up any resources being used.

        /// </summary>

        /// <param name="disposing">true if managed resources should be disposed; otherwise, false.</param>

        protected override void Dispose(bool disposing)

        {

            if (disposing && (components != null))

            {

                components.Dispose();

            }

            base.Dispose(disposing);

        }
 

        #region Windows Form Designer generated code
 

        /// <summary>

        /// Required method for Designer support - do not modify

        /// the contents of this method with the code editor.

        /// </summary>

        private void InitializeComponent()

        {

            this.dataSet1 = new System.Data.DataSet();

            this.dataGridView1 = new System.Windows.Forms.DataGridView();

            ((System.ComponentModel.ISupportInitialize)(this.dataSet1)).BeginInit();

            ((System.ComponentModel.ISupportInitialize)(this.dataGridView1)).BeginInit();

            this.SuspendLayout();

            // 

            // dataSet1

            // 

            this.dataSet1.DataSetName = "NewDataSet";

            // 

            // dataGridView1

            // 

            this.dataGridView1.Anchor = ((System.Windows.Forms.AnchorStyles)((((System.Windows.Forms.AnchorStyles.Top | System.Windows.Forms.AnchorStyles.Bottom)

                        | System.Windows.Forms.AnchorStyles.Left)

                        | System.Windows.Forms.AnchorStyles.Right)));

            this.dataGridView1.ColumnHeadersHeightSizeMode = System.Windows.Forms.DataGridViewColumnHeadersHeightSizeMode.AutoSize;

            this.dataGridView1.Location = new System.Drawing.Point(12, 12);

            this.dataGridView1.Name = "dataGridView1";

            this.dataGridView1.Size = new System.Drawing.Size(616, 242);

            this.dataGridView1.TabIndex = 1;

            // 

            // Form1

            // 

            this.AutoScaleDimensions = new System.Drawing.SizeF(6F, 13F);

            this.AutoScaleMode = System.Windows.Forms.AutoScaleMode.Font;

            this.ClientSize = new System.Drawing.Size(640, 266);

            this.Controls.Add(this.dataGridView1);

            this.Name = "Form1";

            this.Text = "Form1";

            this.Load += new System.EventHandler(this.Form1_Load);

            ((System.ComponentModel.ISupportInitialize)(this.dataSet1)).EndInit();

            ((System.ComponentModel.ISupportInitialize)(this.dataGridView1)).EndInit();

            this.ResumeLayout(false);
 

        }
 

        #endregion
 

        private System.Data.DataSet dataSet1;

        private System.Windows.Forms.DataGridView dataGridView1;
 

    }

}
 

-----Form1.cs---------------------

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Text;

using System.Windows.Forms;

using System.Net;

using System.Text.RegularExpressions;

using System.IO;
 

namespace RegexTest

{

    public partial class Form1 : Form

    {

        public Form1()

        {

            InitializeComponent();

        }
 

        private void Form1_Load(object sender, EventArgs e)

        {

            string sourceFile = GetSourceForPage("http://www.clerk-alachua-fl.org/pa/pa.urd/pamw2000*o_case_sum?83518636");

            GetFieldsInPage(sourceFile);

        }
 

        private string GetSourceForPage(string url)

        {

            HttpWebRequest myreq = (HttpWebRequest)HttpWebRequest.Create(url);

            StreamReader r = new StreamReader(myreq.GetResponse().GetResponseStream());

            string tmpStr = r.ReadToEnd();

            r.Close();

            return tmpStr;

        }
 

        private void GetFieldsInPage(string htmlSource)

        {

            this.dataSet1.Clear();

            DataTable myTable = new DataTable();
 

            List<string> NameList = new List<string>();

            List<string> ValueList = new List<string>();
 

            Regex myReg = new Regex(@"\<INPUT\ TYPE\=HIDDEN\ NAME=\""(?<VariableName>[^\""]+)\""\ VALUE=\""(?<VariableValue>[^\""]+)\""[^\>]*\>",RegexOptions.IgnoreCase | RegexOptions.Multiline);

            MatchCollection myMatches = myReg.Matches(htmlSource);

            foreach (Match myMatch in myMatches)

            {

                string varName = myMatch.Groups["VariableName"].Value;

                string varValue = myMatch.Groups["VariableValue"].Value;

                varName = varName.Replace('.', '_');
 

                NameList.Add(varName);

                ValueList.Add(varValue);

            }
 

            myTable.TableName = "ScrapeData";
 

            for (int z = 0; z < NameList.Count; z++)

            {

                DataColumn newCol = myTable.Columns.Add();

                newCol.AllowDBNull = true;

                newCol.DataType = typeof(string);

                newCol.ColumnName = NameList[z];

                newCol.MaxLength = 32767;

            }
 

            DataRow myRow = myTable.NewRow();
 

            for (int z = 0; z < ValueList.Count; z++)

            {

                myRow[z] = ValueList[z];

            }
 

            myTable.Rows.Add(myRow);
 

            this.dataSet1.Tables.Add(myTable);

            this.dataGridView1.DataSource = this.dataSet1.Tables[0].DefaultView;
 

        }

    }

}

Open in new window

0
 

Author Comment

by:MiracleByDesign
ID: 21825948
I will add this code to my project tonight and let you know what happens.  Thank you very much for your help with this project.
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21826208
No prob let me know how it goes.
0
 

Author Comment

by:MiracleByDesign
ID: 21912123
BitRunner303-The solution works but the client has another request that I am not sure you can help me with. I need to save the actual scraper to an XML file so it can be used in another program.  Do you have any idea how I would do this?

thanks,
MiracleByDesign
0
 
LVL 3

Expert Comment

by:BitRunner303
ID: 21921599
Simple.  Here's a tutorial on using the XML writer features in .NET: http://www.c-sharpcorner.com/UploadFile/mahesh/writexmlusingXmlWriter11132005233450PM/writexmlusingXmlWriter.aspx

You'll simply iterate through the rows your final DataSet, writing out the elements of each record.  I would probably do it something like so (snippet).

Basically you'd write a root element for the file, that I've called CaseFile here.  Then for each Case write out all the XmlElements.  If you need some more help with this let me know but this should get you going.
<CaseFile>

  <Case>

      <DefendantName>John Doe</DefendantName>

      <ProsecutingAttorney>Jack McCoy</ProsecutingAttorney>

  </Case>

</CaseFile>

Open in new window

0
 

Author Closing Comment

by:MiracleByDesign
ID: 31468193
Thank you very much!!!!
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

ASP.Net to Oracle Connectivity Recently I had to develop an ASP.NET application connecting to an Oracle database.As I am doing it first time ,I had to solve several problems. This article will help to such developers  to develop an ASP.NET client…
More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now