Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

How to create a ASP Web Scraping Application

Posted on 2009-06-28
15
Medium Priority
?
1,345 Views
Last Modified: 2012-06-27
Hello everyone. I am attempting to create a web based application that will mine a website for data to store in a database. I have successfully scraped the web page of my target. The problem is that the first page is a log in page. How can i interact with the scraped page to force it to log in and then go into the data mining procedure. Any and all advice will be greatly appreciated.....
0
Comment
Question by:MatrixUnleashed
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
15 Comments
 
LVL 8

Expert Comment

by:lharrispv
ID: 24737562
How are you doing the scraping?  Are you sending an HTTP request?  If so can you put the log in information in the body of the request?
0
 

Author Comment

by:MatrixUnleashed
ID: 24739716
Yeah i have been reading about the HTTP request but im not sure if that would help me. I want to create a application where at midnight every night it performs a scape of data on the target website. The problem is: In VB i can use a web browser control and input the necessary data into the log in page and invoke a button click to log me in. But the data is spread through multiple pages. From what i have read the scape gives you a hard copy of the information on that one page. I haven't come across a way to interact with the page after it has been scraped.

VB would do what i need but i need this to be available over the internet for multiple people to access.

Detailed information:
I am using a third party web application to track what work my technicians have completed. I need to scrape the data so that I can score them, generate reports, and payroll

Page 1: Log In Page
Page 2: List of Technician's work in a combo box
Page 3: Select job details which are spread out in multiple combo boxes

I hope this help explain what i looking to do. I not looking for the answer but path to the answer. Not afraid of research just want to research the right thing. lol

Thanks
0
 

Author Comment

by:MatrixUnleashed
ID: 24739760
Also its a https:// site
0
Understanding Web Applications

Without even knowing it, most of us are using web applications on a daily basis. Gmail and Yahoo email, Twitter, Facebook, and eBay are used by most of us daily—and they are web applications. We often confuse these web applications tools for websites.  So, what is the difference?

 
LVL 8

Accepted Solution

by:
lharrispv earned 2000 total points
ID: 24739910
Personally I would use the HTTP Request.  I think it will be the easiest way to go.  You should be able to Post a request buidling out the body and headers as needed.  The response will contain the text of the page and you can parse out the results into an excel spread sheet, email, report etc.

You can always use Windows Task Scheduler to call the at midnight or whatever the Unix equiv is.
0
 

Author Comment

by:MatrixUnleashed
ID: 24740294
Using the HTTP Request will allow you to invoke click member functions and go to other pages? If so do you know where I can find a tutorial on HTTPRequest. I have been searching Google but haven't found a good tutorial....
0
 
LVL 8

Expert Comment

by:lharrispv
ID: 24740345
Well yes and no.. you cannot actually click with it.  Keep in mind that clicking on a link or a form is nothing more then a post being sent.  That means a HTTP request.. so if you know what page you would be going to after you click you can post tot hat page with the informationt hat would be sent after the click.

Try googling it.  There is TONS out there.  That is what I have been doing.  http request scrape content
0
 

Author Comment

by:MatrixUnleashed
ID: 24740580
Im going through all of them but none are starting from the beginning... For those of us who knows nothing of HTTP request.... Any suggestions???
0
 
LVL 8

Expert Comment

by:lharrispv
ID: 24740688
0
 

Author Comment

by:MatrixUnleashed
ID: 24740788
I dont think the http request works for https sites.... I included a sample i found. I ran it will the sample site then tried mines.
Original Code:    
 
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
        Try
            Dim fr As System.Net.HttpWebRequest
            Dim targetURI As New Uri("http://weblogs.asp.net/farazshahkhan")
            fr = DirectCast(System.Net.HttpWebRequest.Create(targetURI), System.Net.HttpWebRequest)
            'In the above code http://weblogs.asp.net/farazshahkhan is used as an example
            'it can be a different domain with a different filename and extension
            If (fr.GetResponse().ContentLength > 0) Then
                Dim str As New System.IO.StreamReader(fr.GetResponse().GetResponseStream())
                Response.Write(str.ReadToEnd())
                str.Close()
            End If
        Catch ex As System.Net.WebException
            Response.Write("File does not exist.")
        End Try
    End Sub
 
 
 
 
 
 
 
 
Changed Code:
    
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
        Try
            Dim fr As System.Net.HttpWebRequest
            Dim targetURI As New Uri("https://technet.csgsystems.com/hou1/tn/technet.htm?Id=1")
            fr = DirectCast(System.Net.HttpWebRequest.Create(targetURI), System.Net.HttpWebRequest)
            'In the above code http://weblogs.asp.net/farazshahkhan is used as an example
            'it can be a different domain with a different filename and extension
            If (fr.GetResponse().ContentLength > 0) Then
                Dim str As New System.IO.StreamReader(fr.GetResponse().GetResponseStream())
                Response.Write(str.ReadToEnd())
                str.Close()
            End If
        Catch ex As System.Net.WebException
            Response.Write("File does not exist.")
        End Try
    End Sub

Open in new window

0
 
LVL 8

Expert Comment

by:lharrispv
ID: 24745184
From MSDN
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

Do not use the HttpWebRequest constructor. Use the WebRequest..::.Create method to initialize new HttpWebRequest objects. If the scheme for the Uniform Resource Identifier (URI) is http:// or https://, Create returns an HttpWebRequest object.

0
 
LVL 17

Expert Comment

by:selvol
ID: 24761011
Here is the answer. And not some plug trying to sell you something.

I only describe what I have found myself whilt using the software over the last 6 years.

Offline explorer  enterprise.  
Free 30 Day Full Featured Trial.

If you need to scrub your harvest.
OE will intergrate with "TEXTPIPE" and clean/ format the files.
NO FULL TRIAL for Textpipe.


Now OE Enterprise is very powerfull.
NOT SOME 1/2 Butted software for KIDS to get myspace profiles.
But it will do that  


#1 Info ripper/ harvester I have come across.

At first you will not see the full potential.
Don't get discuraged.
This software can do what you want. You just have to learn how to teach the software.
Enterprise can D/load Millions of URL with the push of a button.

The scripts/commands will emulate PHP like URL rewriting.
You can
Tell it to start at
http://joessite.xxx/1{:000000..999999}.php

And it will Dl-oad 1 millions pages from joessite.
Add filters like keywords, text, dir,

I can go on for a while.

Oh yea it will login too.
At Midnight and Not get Copies of the page you already have.

If you are serious about your harvest.
I suggest you get OE enterprise

This app will do more then you need ...

100,000,000 pages downloaded with it  myself...

No I don't get paid to promote this company.
http://www.metaproducts.com/mp/Offline_Explorer_Enterprise.htm
0
 

Author Comment

by:MatrixUnleashed
ID: 24762092
Thanks but i would like to do this project myself not just for completion but for the learning as well. I really feel the others comments are putting me on the right track. I successfully scraped ....
0
 

Author Comment

by:MatrixUnleashed
ID: 24762115
(cont)... my target but im reading and reading but cant understand how to login with the targetd page not an asp?
0
 
LVL 8

Expert Comment

by:lharrispv
ID: 24762363
Matrix,

Glad to hear my advice is working so far.  Here is some more info for you.  It is C# but it should give you a place to start.  I skimmed it rather then reading it fully but it looks like they are using http.webclient to log in.  Might be that you have to log in first then send the request.. any way here is the link.  Check it out and let me know what you think.

http://forums.asp.net/p/1441206/3270169.aspx
0
 
LVL 8

Expert Comment

by:lharrispv
ID: 24839376
Matrix.. how is this going?
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
Originally, this post was published on Monitis Blog, you can check it here . It goes without saying that technology has transformed society and the very nature of how we live, work, and communicate in ways that would’ve been incomprehensible 5 ye…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question