Link to home
Start Free TrialLog in
Avatar of icb01co2
icb01co2

asked on

HTML parsing - page segmentation

Hi Experts,

           This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -

- Theoretical ideas on how humans instinctively segment a web page document into sections.

- What can constitue vertical splits in HTML other than table and div tags?

- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?

- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.  

-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?

I know its alot but any help would be appreciated.
Thanks, Chris.
Avatar of DrWarezz
DrWarezz

Hi,
check out: www.javaalmanac.com  and search for "HTML". And check out some of the results.
Use the java.util.* package to tokenize the document perhaps.

>"is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?"
Sure, use the code here to download an image:
http://www.javaalmanac.com/egs/java.net/GetImage.html
Then, apply the Image to a BufferedImage, and use the code here to get the res of the image:
http://www.javaalmanac.com/egs/java.awt.image/ImagePixel.html

I'll have a read over some of the other problems, and see what I can do ....

HTH :)
[r.D]
Avatar of icb01co2

ASKER

Wow, thanks for your help and quick response i thought i'd just rambled on and no one would understand what i was going on.
lol -- I always assume the same when I post a Q.

anyways, back to the Q...

>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???

Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)

[r.D]
ASKER CERTIFIED SOLUTION
Avatar of DrWarezz
DrWarezz

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
sorry im having a few com problems. Ill get back to you asap.
Sure
ok sorry about that, ill try and explain what i wanna do but i may not be able to. I'll have four main objects:


TagData:  tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)


TagTree:  basic data sctructure for holding a tags frequency and attribute info
               to be used to hold tag info as a tree data structure.


TreeTree: data structure that holds trees after segmentation.

mmm if that doesnt make sense then consider this: -


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.

The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.

Jesus even i dont understand this Goodluck.




...more detail

TagTree 1


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

.....Becomes


TagTree 2


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>


-----------------------------------------------------------------------------------

TagTree 3

           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>



So the TreeTree object looks like this:

                                  TagTree1
                                  /            \
                                /                \
      TagTree 2(Section A)           TagTree 3 (Section B and C)

and after next split:

                                   TagTree1
                                  /            \
                                /                \
    TagTree 2(Section A)           TagTree 3 (Section B and C)
                                                /            \
                                              /                \
                    TagTree 4(Section B)           TagTree 5(Section C)








Okay - I'm going to need to read over all of this a few more times lol -- I'll get back to you as soon as I've got an idea :)
[r.D]
Any way i think this is too much info i just really need to know what consists a split in a page. In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute.

Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.

Thanks, Chris.
hmm :o\ That's really complicated. lol.

>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?

ta,
[r.D]
lol, no problems. What i mean is that the </td><td> tags split the table vertically. This in its self doesn't constitute a logical layout segmentation, by this i mean that as humans we have the ability to look at a site and logically segmant the page into sections.

What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
Anyway you have been more than patient and helped me no end. Ill give you these points before i forget, thanks for everything.

Chris.
Thanks for the points Chris.
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.

Best of luck with it,
[r.D]