• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 345
  • Last Modified:

Need to programmatically capture information from a website.

I am trying to gather information from a website (that I don't own)

I log in manually, and then using EFGrabber and WinBatch, I have created a macro that scrolls through records that I have searched and extracts the information I am looking for into an excel file (without my having to type each record one by one....)

The challenge is this -- Because it's a click location based macro, if the next button is even slightly off, the macro hangs.

I am wondering if there is a better way to do this.... Can i use XML, or write a script, or something like that?

Any direction would be greatly appreciated. I'm kind of at a loss on how to proceed.

FYI: I am proficient in ASP and .NET, VB, and VBScript, but could use another language :sigh: if absolutely necessary.

  • 3
  • 3
2 Solutions
Have permission?
There's a microsoft component called MSXML 4.0
you can download it free from microsoft

I've copied this tidbit from the SDK docs:
var xmlhttp = new ActiveXObject("Msxml2.XMLHTTP.4.0");
xmlHTTP.open("GET","http://myserver/save.asp", false);

or this, using the server version of the control:
var srvXmlHttp
srvXmlHttp = Server.CreateObject("Msxml2.ServerXMLHTTP.4.0");
srvXmlHttp.open ("GET", "http://myserver/myresponse.asp", false);
newsElement = srvXmlHttp.responseXML.selectSingleNode("/news/story1");

<p>Top News Story<p>

Basically, these controlls let you open a web page and work with the results. If the page has consistant formatting (ie lets say the data you need is in a table that always has the same ID) you can use that to pull specific pieces of the page you need.

- Jack

VeeVanAuthor Commented:
Can I use the XML to jump from one page to the next?

Here's the scenario:
1.  I manually do a search for the information that I want.
2.  It then comes up in a list format that has partial information.
3.  I click on one of the detail records and pull out a couple of fields into Excel using EFGrabber.
4.  Then, I click next to go to the next detail page.
5.  I repeat steps 3 and 4 until I have all the info that I want.

BTW: The stuff I'm searching is in public domain (County Property Appraiser's Website) I'm just trying to avoid the manual process of copying, pasting, or typing. Takes a really long time.

So what I need is a method to both grab the information: for which it seems XML should work nicely.....and also forward to the next page programmatically -- and I don't know if that's possible.

Thanks again for all your assistance.

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

>BTW: The stuff I'm searching is in public domain (County Property Appraiser's Website) I'm just trying to avoid the manual process of copying, pasting, or typing. Takes a really long time.

Thank you.
Using the XMLHTTP object, you can emulate a web browser, which means you can use both the post and get methods.  The XML part of the name isn't really relavent to you here, it is the HTTP part of the name you need.

What you want to concentrate on is the links to the detail records.

If you can predict the links for the properties you need the details for, you can write a script to fetch those pages using the XMLHTTP object as the go between.

It will return the content of the page and you can use string manipulation functions (Left, Right, Mid, InStr) to carve out the part you want.

Add that to the file system object and you can write the results to a CSV file on your machine which Excel will open as though it were native XLS.

<%@ Language = VBScript %>
Response.Buffer = True
Dim objHTTP, myVariable

' Create an xmlhttp object:
Set objHTTP = Server.CreateObject("Microsoft.XMLHTTP")

objHTTP.Open "GET", "http://www.domain.com/propertydetails.asp?param1=val1¶m2=val2", False

myVariable = objHTTP.responseText
Set objHTTP = Nothing

After the above code, myVariable has everything that was output on the web page in a plain old character string.  Now you can commence to strip out what you need.

VeeVanAuthor Commented:
That's exactly what I was looking for. Thanks for your help.

I have dabbled in XML in the past, and had a sneaking suspicion that it would do what I wanted, but wasn't sure.

One last simple Q -- Do you know, can I use XML in .NET? I think I can. (I think I can, I think I can....)

I appreciate your help!

Okay, this is an object that happes to be able to pull a document via HTTP like a browser, which would include an XML document, bit in this case you are only retrieving the HTML source of the page.  Not XML.

Just to clarify.

Yes you can use it in .NET, PHP, ASP calssig, JScript, JavaScript, etc. etc.  It has become quite ubiquitous (sp?)

Have fun,
VeeVanAuthor Commented:

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now