Link to home
Start Free TrialLog in
Avatar of scottmichael2
scottmichael2

asked on

Get Visible Text from Displayed Web Page

I need to get just the text that is being displayed on a web page (nothing that is hidden).  The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).

Is there a way to get just the text that is visible to the user?

Thanks!
Avatar of Jose Parrot
Jose Parrot
Flag of Brazil image

Hi,

The simplest solution is a parser that skips everything between "<" and ">" (except content="sometext", thus showing   sometext) and showing what is ouside of that.

For example, in the html below:

___________________________________________________________________
<table class=fullWidth>
 <tr>
    <td class=questionHeader style="background-color:white;padding:0px;">
</td>
<td>
    </td>
</tr>
  <tr>
    <td class=questionBody colspan=2>

<br>
      <div id=intelliTxt style="padding-left:5px;padding-right:5px">
        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).<br /><br />Is there a way to get just the text that is visible to the user?<br /><br />Thanks!</div>
      <br />
    </td>
  </tr>
  <tr>
<td class=boxTitle style="white-space: nowrap;">
<img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/emailFriend.jsp?qid=22045078">Send to a Friend</a>
      &nbsp;&nbsp;
      <img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/viewQuestionPrinterFriendly.jsp?qid=22045078" target="_blank">Printer Friendly</a>
    </td>
<td class=boxTitle style="white-space: nowrap;">
</td>
     </tr>
    </table>
___________________________________________________________________


The following text will "pass" through the filter:

___________________________________________________________________

        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed)
Is there a way to get just the text that is visible to the user?
Thanks!
Send to a Friend
Printer Friendly
___________________________________________________________________

You may elaborate to work on tags to have a minimum formating and treat things like &nbsp; for example.

Jose
Avatar of scottmichael2
scottmichael2

ASKER

I can already get the text and parse it and all.  The problem that I'm having this that I have 2 pages that have identical HTML, but are displaying different things to the user.  I need to find out which "screen" they are viewing.  Here is a more thorough explanation...

Page 1: Patient Adress Info
Page 2: Patient Insurance Info

When you do a View Source, or look at the DOM or whatever, it returns identical HTML for both pages.  Something is going on behind the scenes that causes the user display to change (but not the HTML).

Is there a way to grab just the text that is displayed?

Or, another solution for me would be to mimic the search that IE does when you do an Edit->Find (on This Page) from the IE menu.  Any ideas on how to mimic this in VB?

Thanks!
If two pages have identical HTML then they are identical.

I'm not sure if understood well. I see two options:
1. Your HTML is so long that the user should scroll to see Adress or Insurance.
    (It would be easy to solve.)
2. Or the page has frames and what you see is the main html code,
    but one of the frames content is dinamicaly loaded as a result of a query.

I will assume the second option.

Suppose you have the below MyPage.html:

<frameset cols="110,600" > 
  <frameset rows="80,*" > 
    <frame name="topFrame1" src="logo.html" >
    <frame name="leftFrame"  src="menuLeft.html">
  </frameset>
  <frameset rows="80,*"  >
    <frame name="topFrame2"  src="header.html" >
    <frame name="mainFrame" src="information.html">
  </frameset>
</frameset>

This makes a page divided in two rows, each one subdivided in two columns.

Supposing you click in the "Patient Address" option at MenuLeft.html (leftFrame), then the server side create a new information.html  and reloads it in the mainFrame.
Supposing another user click in "Patient Insurance" and the server side loads the query result at information.html (mainFrame).

This situation can make two different contents under the same html code.

We need more information on how the information is stored, if you have direct access to the server of such data, and so on. Also if the web server require login, or some user identification or cockies. Are your users and the web server from the same organization? Can you make an arrangement with the webadmin of such site?

Depending on how much the information is protected, we have no way of circumvect it.

Jose
Pass the appropriate text dynamically on the runtime from the code behiend to the html tag where you want to show this text .
like From your  Patient Adress Info page pass the text from codebehiend :"your Text Here"
Example:

For Asp.net use:------------

''HtmlGenericControl  for dynamic transfer html from codbehind to client side
Dim gc As New HtmlGenericControl

And Pass the text like:
    gc.InnerHtml &= "<td align='center' style='font-family:Arial;font-size:13.5px;font-weight:bold'>Your Text Here</td>"
    TRfordynamicText.Controls.Add(gc)

(Note : "TRfordynamicText" is my TableRow in the Html"




JoseParrot -
That isn't exactly true.  One example is when you have JavaScript that sets CSS properties to hide and show different things.  In this case, the HTML (and JavaScript, CSS, etc.) is identical, but the resultant page is different (meaning that the visible portion is different to the user).  I need a way to get just the visible portion of the page.
Well, I'm not the true owner. I can be wrong, of course, I'm just trying to help.
But you'll agree, we don't have so much information about the web site and the more we can do is to speculate on.

For example:
Seems to be a web site from other than your organization.
Seems too that the information at such site is for their customers queries, as a service to their clients.
Also seems you are trying to extract third part information in an automatic way.
Concluding, seems that you try to run queries at 3rd part servers to present their information as your service.

Another example:
Seems the web team of your company was entirely fired because they have lost all the codes and your almost impossible mission is to maintain the web site runing to avoid severe sues from your customers and the only you can is to encapsulate the old site info inside a temporary code, meanwhile you reconstruct a new web site.

Let me suggest you to post the url of such site, so we can navigate there, probably understand better the problem and be more direct and objective in supporting you.

Jose
Interesting analysis.  You are very close on your first guess.  Unfortunately, we don't have access to their database.  And, I'm not sure that it would matter that much because of the type of application we are running.  

The site is an internal Intranet site belonging to one of our clients.  We are extracting data for one of our applications.  And, part of our process is to keep track of what page the user is viewing.  This is impossible if the HTML is rendered identical.  Unfortunately, it is.  It seems like the only thing that changes is what the user is viewing.  So, I'm stuck trying to figure out a way to extract exactly what they are viewing.  I have tried several things like going through the DOM, but all seem to return identical HTML (and JavaScript, CSS, and what not).  My next attempt will be to just mimic how Microsoft does Edit->Search (which I think I might have figured out).  Any other suggestions are welcome.
ASKER CERTIFIED SOLUTION
Avatar of Jose Parrot
Jose Parrot
Flag of Brazil image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
JoseParrot - thanks for your efforts on this.  I figured out the issue.  The application was using CSS to hide/show content.  I simply looked at the Visibility and Display CSS properties to find out which elements were being shown/hidden.