Solved

Get Visible Text from Displayed Web Page

Posted on 2006-11-01
9
520 Views
Last Modified: 2009-07-29
I need to get just the text that is being displayed on a web page (nothing that is hidden).  The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).

Is there a way to get just the text that is visible to the user?

Thanks!
0
Comment
Question by:scottmichael2
  • 4
  • 4
9 Comments
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17849726
Hi,

The simplest solution is a parser that skips everything between "<" and ">" (except content="sometext", thus showing   sometext) and showing what is ouside of that.

For example, in the html below:

___________________________________________________________________
<table class=fullWidth>
 <tr>
    <td class=questionHeader style="background-color:white;padding:0px;">
</td>
<td>
    </td>
</tr>
  <tr>
    <td class=questionBody colspan=2>

<br>
      <div id=intelliTxt style="padding-left:5px;padding-right:5px">
        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).<br /><br />Is there a way to get just the text that is visible to the user?<br /><br />Thanks!</div>
      <br />
    </td>
  </tr>
  <tr>
<td class=boxTitle style="white-space: nowrap;">
<img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/emailFriend.jsp?qid=22045078">Send to a Friend</a>
      &nbsp;&nbsp;
      <img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/viewQuestionPrinterFriendly.jsp?qid=22045078" target="_blank">Printer Friendly</a>
    </td>
<td class=boxTitle style="white-space: nowrap;">
</td>
     </tr>
    </table>
___________________________________________________________________


The following text will "pass" through the filter:

___________________________________________________________________

        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed)
Is there a way to get just the text that is visible to the user?
Thanks!
Send to a Friend
Printer Friendly
___________________________________________________________________

You may elaborate to work on tags to have a minimum formating and treat things like &nbsp; for example.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17851844
I can already get the text and parse it and all.  The problem that I'm having this that I have 2 pages that have identical HTML, but are displaying different things to the user.  I need to find out which "screen" they are viewing.  Here is a more thorough explanation...

Page 1: Patient Adress Info
Page 2: Patient Insurance Info

When you do a View Source, or look at the DOM or whatever, it returns identical HTML for both pages.  Something is going on behind the scenes that causes the user display to change (but not the HTML).

Is there a way to grab just the text that is displayed?

Or, another solution for me would be to mimic the search that IE does when you do an Edit->Find (on This Page) from the IE menu.  Any ideas on how to mimic this in VB?

Thanks!
0
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17852462
If two pages have identical HTML then they are identical.

I'm not sure if understood well. I see two options:
1. Your HTML is so long that the user should scroll to see Adress or Insurance.
    (It would be easy to solve.)
2. Or the page has frames and what you see is the main html code,
    but one of the frames content is dinamicaly loaded as a result of a query.

I will assume the second option.

Suppose you have the below MyPage.html:

<frameset cols="110,600" > 
  <frameset rows="80,*" > 
    <frame name="topFrame1" src="logo.html" >
    <frame name="leftFrame"  src="menuLeft.html">
  </frameset>
  <frameset rows="80,*"  >
    <frame name="topFrame2"  src="header.html" >
    <frame name="mainFrame" src="information.html">
  </frameset>
</frameset>

This makes a page divided in two rows, each one subdivided in two columns.

Supposing you click in the "Patient Address" option at MenuLeft.html (leftFrame), then the server side create a new information.html  and reloads it in the mainFrame.
Supposing another user click in "Patient Insurance" and the server side loads the query result at information.html (mainFrame).

This situation can make two different contents under the same html code.

We need more information on how the information is stored, if you have direct access to the server of such data, and so on. Also if the web server require login, or some user identification or cockies. Are your users and the web server from the same organization? Can you make an arrangement with the webadmin of such site?

Depending on how much the information is protected, we have no way of circumvect it.

Jose
0
Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

 
LVL 3

Expert Comment

by:jay_gadhavi
ID: 17855753
Pass the appropriate text dynamically on the runtime from the code behiend to the html tag where you want to show this text .
like From your  Patient Adress Info page pass the text from codebehiend :"your Text Here"
Example:

For Asp.net use:------------

''HtmlGenericControl  for dynamic transfer html from codbehind to client side
Dim gc As New HtmlGenericControl

And Pass the text like:
    gc.InnerHtml &= "<td align='center' style='font-family:Arial;font-size:13.5px;font-weight:bold'>Your Text Here</td>"
    TRfordynamicText.Controls.Add(gc)

(Note : "TRfordynamicText" is my TableRow in the Html"




0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17867971
JoseParrot -
That isn't exactly true.  One example is when you have JavaScript that sets CSS properties to hide and show different things.  In this case, the HTML (and JavaScript, CSS, etc.) is identical, but the resultant page is different (meaning that the visible portion is different to the user).  I need a way to get just the visible portion of the page.
0
 
LVL 18

Expert Comment

by:JoseParrot
ID: 17871363
Well, I'm not the true owner. I can be wrong, of course, I'm just trying to help.
But you'll agree, we don't have so much information about the web site and the more we can do is to speculate on.

For example:
Seems to be a web site from other than your organization.
Seems too that the information at such site is for their customers queries, as a service to their clients.
Also seems you are trying to extract third part information in an automatic way.
Concluding, seems that you try to run queries at 3rd part servers to present their information as your service.

Another example:
Seems the web team of your company was entirely fired because they have lost all the codes and your almost impossible mission is to maintain the web site runing to avoid severe sues from your customers and the only you can is to encapsulate the old site info inside a temporary code, meanwhile you reconstruct a new web site.

Let me suggest you to post the url of such site, so we can navigate there, probably understand better the problem and be more direct and objective in supporting you.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17874631
Interesting analysis.  You are very close on your first guess.  Unfortunately, we don't have access to their database.  And, I'm not sure that it would matter that much because of the type of application we are running.  

The site is an internal Intranet site belonging to one of our clients.  We are extracting data for one of our applications.  And, part of our process is to keep track of what page the user is viewing.  This is impossible if the HTML is rendered identical.  Unfortunately, it is.  It seems like the only thing that changes is what the user is viewing.  So, I'm stuck trying to figure out a way to extract exactly what they are viewing.  I have tried several things like going through the DOM, but all seem to return identical HTML (and JavaScript, CSS, and what not).  My next attempt will be to just mimic how Microsoft does Edit->Search (which I think I might have figured out).  Any other suggestions are welcome.
0
 
LVL 18

Accepted Solution

by:
JoseParrot earned 500 total points
ID: 17875349
I understand your problem and, without a direct look at the site and exactly how the dynamic code is generated it is really difficult. Not a trivial one. Unfortunately no other ideas occurs to me. Hope someone have solved a similar situation and can give a hint.
Jose
0
 
LVL 1

Author Comment

by:scottmichael2
ID: 17995212
JoseParrot - thanks for your efforts on this.  I figured out the issue.  The application was using CSS to hide/show content.  I simply looked at the Visibility and Display CSS properties to find out which elements were being shown/hidden.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
topping2 challenge 13 91
Path of Workbook 3 79
Modify a small python script 19 110
Reccomended programming language for client-server applications 12 100
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…

803 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question