Solved

HTML parsing - page segmentation

Posted on 2004-10-17
14
387 Views
Last Modified: 2011-10-03
Hi Experts,

           This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -

- Theoretical ideas on how humans instinctively segment a web page document into sections.

- What can constitue vertical splits in HTML other than table and div tags?

- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?

- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.  

-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?

I know its alot but any help would be appreciated.
Thanks, Chris.
0
Comment
Question by:icb01co2
  • 7
  • 7
14 Comments
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333665
Hi,
check out: www.javaalmanac.com  and search for "HTML". And check out some of the results.
Use the java.util.* package to tokenize the document perhaps.

>"is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?"
Sure, use the code here to download an image:
http://www.javaalmanac.com/egs/java.net/GetImage.html
Then, apply the Image to a BufferedImage, and use the code here to get the res of the image:
http://www.javaalmanac.com/egs/java.awt.image/ImagePixel.html

I'll have a read over some of the other problems, and see what I can do ....

HTH :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333679
Wow, thanks for your help and quick response i thought i'd just rambled on and no one would understand what i was going on.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333693
lol -- I always assume the same when I post a Q.

anyways, back to the Q...

>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???

Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)

[r.D]
0
 
LVL 9

Accepted Solution

by:
DrWarezz earned 325 total points
ID: 12333715
also..
>"- What can constitue vertical splits in HTML other than table and div tags?"
I don't personally know of any other method.

and
>"-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?"
You can't make the JavaScript perform server-side, however, don't most webpages that do this redirect you, dependant on the screen resolution? In which case, the redirection should prevent you from getting any JavaScript, and will only display the appropriate page.
Or, you could simply extract (and ignore) all JavaScript. (Simply remove everything between the "<script></script>" tags)...?
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333767
sorry im having a few com problems. Ill get back to you asap.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333772
Sure
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333944
ok sorry about that, ill try and explain what i wanna do but i may not be able to. I'll have four main objects:


TagData:  tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)


TagTree:  basic data sctructure for holding a tags frequency and attribute info
               to be used to hold tag info as a tree data structure.


TreeTree: data structure that holds trees after segmentation.

mmm if that doesnt make sense then consider this: -


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.

The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.

Jesus even i dont understand this Goodluck.




0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 1

Author Comment

by:icb01co2
ID: 12333982
...more detail

TagTree 1


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

.....Becomes


TagTree 2


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>


-----------------------------------------------------------------------------------

TagTree 3

           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>



So the TreeTree object looks like this:

                                  TagTree1
                                  /            \
                                /                \
      TagTree 2(Section A)           TagTree 3 (Section B and C)

and after next split:

                                   TagTree1
                                  /            \
                                /                \
    TagTree 2(Section A)           TagTree 3 (Section B and C)
                                                /            \
                                              /                \
                    TagTree 4(Section B)           TagTree 5(Section C)








0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333998
Okay - I'm going to need to read over all of this a few more times lol -- I'll get back to you as soon as I've got an idea :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334011
Any way i think this is too much info i just really need to know what consists a split in a page. In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute.

Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.

Thanks, Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12334105
hmm :o\ That's really complicated. lol.

>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?

ta,
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334147
lol, no problems. What i mean is that the </td><td> tags split the table vertically. This in its self doesn't constitute a logical layout segmentation, by this i mean that as humans we have the ability to look at a site and logically segmant the page into sections.

What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334173
Anyway you have been more than patient and helped me no end. Ill give you these points before i forget, thanks for everything.

Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12338760
Thanks for the points Chris.
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.

Best of luck with it,
[r.D]
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to split this in C++ 4 93
strCopies  challenge 17 81
Microsoft C++ code failing in executable that worked 9 83
Java Loop 6 48
I know it’s not a new topic to discuss and it has lots of online contents already available over the net. But Then I thought it would be useful to this site’s visitors and can have online repository on vim most commonly used commands. This post h…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now