Solved

Get Visible Text from Displayed Web Page

Posted on 2006-11-01
9
525 Views
Last Modified: 2009-07-29
I need to get just the text that is being displayed on a web page (nothing that is hidden).  The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).

Is there a way to get just the text that is visible to the user?

Thanks!
0
Comment
Question by:scottmichael2
  • 4
  • 4
9 Comments
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17849726
Hi,

The simplest solution is a parser that skips everything between "<" and ">" (except content="sometext", thus showing   sometext) and showing what is ouside of that.

For example, in the html below:

___________________________________________________________________
<table class=fullWidth>
 <tr>
    <td class=questionHeader style="background-color:white;padding:0px;">
</td>
<td>
    </td>
</tr>
  <tr>
    <td class=questionBody colspan=2>

<br>
      <div id=intelliTxt style="padding-left:5px;padding-right:5px">
        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).<br /><br />Is there a way to get just the text that is visible to the user?<br /><br />Thanks!</div>
      <br />
    </td>
  </tr>
  <tr>
<td class=boxTitle style="white-space: nowrap;">
<img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/emailFriend.jsp?qid=22045078">Send to a Friend</a>
      &nbsp;&nbsp;
      <img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/viewQuestionPrinterFriendly.jsp?qid=22045078" target="_blank">Printer Friendly</a>
    </td>
<td class=boxTitle style="white-space: nowrap;">
</td>
     </tr>
    </table>
___________________________________________________________________


The following text will "pass" through the filter:

___________________________________________________________________

        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed)
Is there a way to get just the text that is visible to the user?
Thanks!
Send to a Friend
Printer Friendly
___________________________________________________________________

You may elaborate to work on tags to have a minimum formating and treat things like &nbsp; for example.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17851844
I can already get the text and parse it and all.  The problem that I'm having this that I have 2 pages that have identical HTML, but are displaying different things to the user.  I need to find out which "screen" they are viewing.  Here is a more thorough explanation...

Page 1: Patient Adress Info
Page 2: Patient Insurance Info

When you do a View Source, or look at the DOM or whatever, it returns identical HTML for both pages.  Something is going on behind the scenes that causes the user display to change (but not the HTML).

Is there a way to grab just the text that is displayed?

Or, another solution for me would be to mimic the search that IE does when you do an Edit->Find (on This Page) from the IE menu.  Any ideas on how to mimic this in VB?

Thanks!
0
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17852462
If two pages have identical HTML then they are identical.

I'm not sure if understood well. I see two options:
1. Your HTML is so long that the user should scroll to see Adress or Insurance.
    (It would be easy to solve.)
2. Or the page has frames and what you see is the main html code,
    but one of the frames content is dinamicaly loaded as a result of a query.

I will assume the second option.

Suppose you have the below MyPage.html:

<frameset cols="110,600" > 
  <frameset rows="80,*" > 
    <frame name="topFrame1" src="logo.html" >
    <frame name="leftFrame"  src="menuLeft.html">
  </frameset>
  <frameset rows="80,*"  >
    <frame name="topFrame2"  src="header.html" >
    <frame name="mainFrame" src="information.html">
  </frameset>
</frameset>

This makes a page divided in two rows, each one subdivided in two columns.

Supposing you click in the "Patient Address" option at MenuLeft.html (leftFrame), then the server side create a new information.html  and reloads it in the mainFrame.
Supposing another user click in "Patient Insurance" and the server side loads the query result at information.html (mainFrame).

This situation can make two different contents under the same html code.

We need more information on how the information is stored, if you have direct access to the server of such data, and so on. Also if the web server require login, or some user identification or cockies. Are your users and the web server from the same organization? Can you make an arrangement with the webadmin of such site?

Depending on how much the information is protected, we have no way of circumvect it.

Jose
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 3

Expert Comment

by:jay_gadhavi
ID: 17855753
Pass the appropriate text dynamically on the runtime from the code behiend to the html tag where you want to show this text .
like From your  Patient Adress Info page pass the text from codebehiend :"your Text Here"
Example:

For Asp.net use:------------

''HtmlGenericControl  for dynamic transfer html from codbehind to client side
Dim gc As New HtmlGenericControl

And Pass the text like:
    gc.InnerHtml &= "<td align='center' style='font-family:Arial;font-size:13.5px;font-weight:bold'>Your Text Here</td>"
    TRfordynamicText.Controls.Add(gc)

(Note : "TRfordynamicText" is my TableRow in the Html"




0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17867971
JoseParrot -
That isn't exactly true.  One example is when you have JavaScript that sets CSS properties to hide and show different things.  In this case, the HTML (and JavaScript, CSS, etc.) is identical, but the resultant page is different (meaning that the visible portion is different to the user).  I need a way to get just the visible portion of the page.
0
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17871363
Well, I'm not the true owner. I can be wrong, of course, I'm just trying to help.
But you'll agree, we don't have so much information about the web site and the more we can do is to speculate on.

For example:
Seems to be a web site from other than your organization.
Seems too that the information at such site is for their customers queries, as a service to their clients.
Also seems you are trying to extract third part information in an automatic way.
Concluding, seems that you try to run queries at 3rd part servers to present their information as your service.

Another example:
Seems the web team of your company was entirely fired because they have lost all the codes and your almost impossible mission is to maintain the web site runing to avoid severe sues from your customers and the only you can is to encapsulate the old site info inside a temporary code, meanwhile you reconstruct a new web site.

Let me suggest you to post the url of such site, so we can navigate there, probably understand better the problem and be more direct and objective in supporting you.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17874631
Interesting analysis.  You are very close on your first guess.  Unfortunately, we don't have access to their database.  And, I'm not sure that it would matter that much because of the type of application we are running.  

The site is an internal Intranet site belonging to one of our clients.  We are extracting data for one of our applications.  And, part of our process is to keep track of what page the user is viewing.  This is impossible if the HTML is rendered identical.  Unfortunately, it is.  It seems like the only thing that changes is what the user is viewing.  So, I'm stuck trying to figure out a way to extract exactly what they are viewing.  I have tried several things like going through the DOM, but all seem to return identical HTML (and JavaScript, CSS, and what not).  My next attempt will be to just mimic how Microsoft does Edit->Search (which I think I might have figured out).  Any other suggestions are welcome.
0
 
LVL 18

Accepted Solution

by:
JoseParrot earned 500 total points
ID: 17875349
I understand your problem and, without a direct look at the site and exactly how the dynamic code is generated it is really difficult. Not a trivial one. Unfortunately no other ideas occurs to me. Hope someone have solved a similar situation and can give a hint.
Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17995212
JoseParrot - thanks for your efforts on this.  I figured out the issue.  The application was using CSS to hide/show content.  I simply looked at the Visibility and Display CSS properties to find out which elements were being shown/hidden.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question