Solved

Get Visible Text from Displayed Web Page

Posted on 2006-11-01
9
503 Views
Last Modified: 2009-07-29
I need to get just the text that is being displayed on a web page (nothing that is hidden).  The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).

Is there a way to get just the text that is visible to the user?

Thanks!
0
Comment
Question by:scottmichael2
  • 4
  • 4
9 Comments
 
LVL 18

Expert Comment

by:JoseParrot
Comment Utility
Hi,

The simplest solution is a parser that skips everything between "<" and ">" (except content="sometext", thus showing   sometext) and showing what is ouside of that.

For example, in the html below:

___________________________________________________________________
<table class=fullWidth>
 <tr>
    <td class=questionHeader style="background-color:white;padding:0px;">
</td>
<td>
    </td>
</tr>
  <tr>
    <td class=questionBody colspan=2>

<br>
      <div id=intelliTxt style="padding-left:5px;padding-right:5px">
        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed).<br /><br />Is there a way to get just the text that is visible to the user?<br /><br />Thanks!</div>
      <br />
    </td>
  </tr>
  <tr>
<td class=boxTitle style="white-space: nowrap;">
<img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/emailFriend.jsp?qid=22045078">Send to a Friend</a>
      &nbsp;&nbsp;
      <img src="/mW.gif" class=markerWide><img src="/b.gif" width=1 height=1 class=inline>
<a href="/viewQuestionPrinterFriendly.jsp?qid=22045078" target="_blank">Printer Friendly</a>
    </td>
<td class=boxTitle style="white-space: nowrap;">
</td>
     </tr>
    </table>
___________________________________________________________________


The following text will "pass" through the filter:

___________________________________________________________________

        I need to get just the text that is being displayed on a web page (nothing that is hidden). &nbsp;The problem I'm having is that something dynamic is going on in the background (JavaScript, CSS, or whatever) that does not change the resultant HTML (apparently it is cached and only certain things are displayed as needed)
Is there a way to get just the text that is visible to the user?
Thanks!
Send to a Friend
Printer Friendly
___________________________________________________________________

You may elaborate to work on tags to have a minimum formating and treat things like &nbsp; for example.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
Comment Utility
I can already get the text and parse it and all.  The problem that I'm having this that I have 2 pages that have identical HTML, but are displaying different things to the user.  I need to find out which "screen" they are viewing.  Here is a more thorough explanation...

Page 1: Patient Adress Info
Page 2: Patient Insurance Info

When you do a View Source, or look at the DOM or whatever, it returns identical HTML for both pages.  Something is going on behind the scenes that causes the user display to change (but not the HTML).

Is there a way to grab just the text that is displayed?

Or, another solution for me would be to mimic the search that IE does when you do an Edit->Find (on This Page) from the IE menu.  Any ideas on how to mimic this in VB?

Thanks!
0
 
LVL 18

Expert Comment

by:JoseParrot
Comment Utility
If two pages have identical HTML then they are identical.

I'm not sure if understood well. I see two options:
1. Your HTML is so long that the user should scroll to see Adress or Insurance.
    (It would be easy to solve.)
2. Or the page has frames and what you see is the main html code,
    but one of the frames content is dinamicaly loaded as a result of a query.

I will assume the second option.

Suppose you have the below MyPage.html:

<frameset cols="110,600" >
  <frameset rows="80,*" >
    <frame name="topFrame1" src="logo.html" >
    <frame name="leftFrame"  src="menuLeft.html">
  </frameset>
  <frameset rows="80,*"  >
    <frame name="topFrame2"  src="header.html" >
    <frame name="mainFrame" src="information.html">
  </frameset>
</frameset>

This makes a page divided in two rows, each one subdivided in two columns.

Supposing you click in the "Patient Address" option at MenuLeft.html (leftFrame), then the server side create a new information.html  and reloads it in the mainFrame.
Supposing another user click in "Patient Insurance" and the server side loads the query result at information.html (mainFrame).

This situation can make two different contents under the same html code.

We need more information on how the information is stored, if you have direct access to the server of such data, and so on. Also if the web server require login, or some user identification or cockies. Are your users and the web server from the same organization? Can you make an arrangement with the webadmin of such site?

Depending on how much the information is protected, we have no way of circumvect it.

Jose
0
 
LVL 3

Expert Comment

by:jay_gadhavi
Comment Utility
Pass the appropriate text dynamically on the runtime from the code behiend to the html tag where you want to show this text .
like From your  Patient Adress Info page pass the text from codebehiend :"your Text Here"
Example:

For Asp.net use:------------

''HtmlGenericControl  for dynamic transfer html from codbehind to client side
Dim gc As New HtmlGenericControl

And Pass the text like:
    gc.InnerHtml &= "<td align='center' style='font-family:Arial;font-size:13.5px;font-weight:bold'>Your Text Here</td>"
    TRfordynamicText.Controls.Add(gc)

(Note : "TRfordynamicText" is my TableRow in the Html"




0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 1

Author Comment

by:scottmichael2
Comment Utility
JoseParrot -
That isn't exactly true.  One example is when you have JavaScript that sets CSS properties to hide and show different things.  In this case, the HTML (and JavaScript, CSS, etc.) is identical, but the resultant page is different (meaning that the visible portion is different to the user).  I need a way to get just the visible portion of the page.
0
 
LVL 18

Expert Comment

by:JoseParrot
Comment Utility
Well, I'm not the true owner. I can be wrong, of course, I'm just trying to help.
But you'll agree, we don't have so much information about the web site and the more we can do is to speculate on.

For example:
Seems to be a web site from other than your organization.
Seems too that the information at such site is for their customers queries, as a service to their clients.
Also seems you are trying to extract third part information in an automatic way.
Concluding, seems that you try to run queries at 3rd part servers to present their information as your service.

Another example:
Seems the web team of your company was entirely fired because they have lost all the codes and your almost impossible mission is to maintain the web site runing to avoid severe sues from your customers and the only you can is to encapsulate the old site info inside a temporary code, meanwhile you reconstruct a new web site.

Let me suggest you to post the url of such site, so we can navigate there, probably understand better the problem and be more direct and objective in supporting you.

Jose
0
 
LVL 1

Author Comment

by:scottmichael2
Comment Utility
Interesting analysis.  You are very close on your first guess.  Unfortunately, we don't have access to their database.  And, I'm not sure that it would matter that much because of the type of application we are running.  

The site is an internal Intranet site belonging to one of our clients.  We are extracting data for one of our applications.  And, part of our process is to keep track of what page the user is viewing.  This is impossible if the HTML is rendered identical.  Unfortunately, it is.  It seems like the only thing that changes is what the user is viewing.  So, I'm stuck trying to figure out a way to extract exactly what they are viewing.  I have tried several things like going through the DOM, but all seem to return identical HTML (and JavaScript, CSS, and what not).  My next attempt will be to just mimic how Microsoft does Edit->Search (which I think I might have figured out).  Any other suggestions are welcome.
0
 
LVL 18

Accepted Solution

by:
JoseParrot earned 500 total points
Comment Utility
I understand your problem and, without a direct look at the site and exactly how the dynamic code is generated it is really difficult. Not a trivial one. Unfortunately no other ideas occurs to me. Hope someone have solved a similar situation and can give a hint.
Jose
0
 
LVL 1

Author Comment

by:scottmichael2
Comment Utility
JoseParrot - thanks for your efforts on this.  I figured out the issue.  The application was using CSS to hide/show content.  I simply looked at the Visibility and Display CSS properties to find out which elements were being shown/hidden.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
sumHeights  challenge 17 59
countPairs challenge 7 57
White board coding practice 3 60
When i run adoquery my application freezes 26 89
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now