Solved

Extract text from webpage

Posted on 2007-12-07
4
1,301 Views
Last Modified: 2012-02-15
Hi,

 I have a webpage that I need to extract text from (the url of the page ends in .jsp if it makes a difference?) The text is in a table and I want to be able to extract all the text on the page to an Excel/.txt file so I can work on the file later. I do not want the HTM code of the page, just the text that is displayed. Is this possible?

Many thanks,

Dave.
0
Comment
Question by:wildarmsdave
4 Comments
 
LVL 23

Assisted Solution

by:Ashish Patel
Ashish Patel earned 50 total points
Comment Utility
No not straightly. First you will have to get teh HTM code of the page and then parse everything. You cannot get just the text out of that, you will have to write a function to remove all html tags and stuff. Search google for removing the HTML tags from text.
0
 

Accepted Solution

by:
tsmanyam earned 250 total points
Comment Utility
For this you need to understand the difference between static content and dynamic content. A static page does not change across visits, it just shows the same page all the time. A dynamic page has server-side-scripting that is executed everytime you ask for the page. The rendered/output content for you may just be the same HTML, but because it may come from other server components like a backend database or server side objects/components IN RUNTIME, you may not such a file and have its output read. The JSP filetype is one such type used to deploy server side dynamic content which can ultimately be uotputted as html and shown on webpages. Depending on what the server-side JSP is intended to do, I’ll say Yes/No for your question. Say, if that jsp page is supposed to show a account summary in a tabular form, it basically needs a page prior to it whch asked for username and password etc. And the user profile for THAT USER will be shown in this page. If you just take a jsp and try to decipher it, it would not know WHICH USER details to show to you.

So, assuming that you want to read/copy/capture that static content portion of a dynamic page like jsp just for the purpose of having a pre-formatted text that the junk tags etc, you’d have 2 ways.

1. Rename the .jsp file as a .htm file and open in IE. This will show a partly neat file (because it cant get the dynamic content into its placeholders), but should show all the html portios neatly configured as per their tax direction.

2. If you are using a code editor for JSP (which may need a project to be configured with additional resources), you can see a “preview” or “test run” of the jsp in the IDE itself which gives perfect output for your copy/paste. But remember that even that IDE would ask for test inputs that trigger the dynamic content.
0
 
LVL 38

Assisted Solution

by:PaulHews
PaulHews earned 200 total points
Comment Utility
Here's a simple sample to give you the idea.  You will need references to:
Microsoft VBScript Regular Expressions 5.5
Microsoft WinHTTP Services 5.1
Option Explicit
 

Private Sub Command1_Click()

    Dim HTTP As WinHttpRequest

    Set HTTP = New WinHttpRequest

    

    Dim strBody As String

    

    'Retrieve the HTML from the page'

    HTTP.Open "open", "http://www.theweathernetwork.com/weather/CAQC0363", False

    HTTP.Send

    

    strBody = HTTP.ResponseText

    

    'Eliminate the HTML tags'

    Dim RegEx As RegExp

    Set RegEx = New RegExp

    

    RegEx.Pattern = "<[^>]*>"

    RegEx.Global = True

    strBody = RegEx.Replace(strBody, "")

    

    'Save the cleaned text to a file'

    Dim hFile As Integer

    hFile = FreeFile

    Open "C:\pagetext.txt" For Output As #hFile

    Print #hFile, strBody

    Close #hFile

    

    

    

    

End Sub

Open in new window

0
 

Author Closing Comment

by:wildarmsdave
Comment Utility
Thanks guys. There were some good info in the answers so I've decided to split the points. As the web page was a static page, I saved the page in (.xls) format which opened up perfect in Excel and enabled me to filter out the information I needed.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Most everyone who has done any programming in VB6 knows that you can do something in code like Debug.Print MyVar and that when the program runs from the IDE, the value of MyVar will be displayed in the Immediate Window. Less well known is Debug.Asse…
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
Get people started with the process of using Access VBA to control Outlook using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Microsoft Outlook. Using automation, an Access applic…
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now