Solved

HTML parsing - page segmentation

Posted on 2004-10-17
14
386 Views
Last Modified: 2011-10-03
Hi Experts,

           This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -

- Theoretical ideas on how humans instinctively segment a web page document into sections.

- What can constitue vertical splits in HTML other than table and div tags?

- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?

- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.  

-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?

I know its alot but any help would be appreciated.
Thanks, Chris.
0
Comment
Question by:icb01co2
  • 7
  • 7
14 Comments
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Hi,
check out: www.javaalmanac.com  and search for "HTML". And check out some of the results.
Use the java.util.* package to tokenize the document perhaps.

>"is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?"
Sure, use the code here to download an image:
http://www.javaalmanac.com/egs/java.net/GetImage.html
Then, apply the Image to a BufferedImage, and use the code here to get the res of the image:
http://www.javaalmanac.com/egs/java.awt.image/ImagePixel.html

I'll have a read over some of the other problems, and see what I can do ....

HTH :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
Wow, thanks for your help and quick response i thought i'd just rambled on and no one would understand what i was going on.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
lol -- I always assume the same when I post a Q.

anyways, back to the Q...

>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???

Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)

[r.D]
0
 
LVL 9

Accepted Solution

by:
DrWarezz earned 325 total points
Comment Utility
also..
>"- What can constitue vertical splits in HTML other than table and div tags?"
I don't personally know of any other method.

and
>"-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?"
You can't make the JavaScript perform server-side, however, don't most webpages that do this redirect you, dependant on the screen resolution? In which case, the redirection should prevent you from getting any JavaScript, and will only display the appropriate page.
Or, you could simply extract (and ignore) all JavaScript. (Simply remove everything between the "<script></script>" tags)...?
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
sorry im having a few com problems. Ill get back to you asap.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Sure
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
ok sorry about that, ill try and explain what i wanna do but i may not be able to. I'll have four main objects:


TagData:  tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)


TagTree:  basic data sctructure for holding a tags frequency and attribute info
               to be used to hold tag info as a tree data structure.


TreeTree: data structure that holds trees after segmentation.

mmm if that doesnt make sense then consider this: -


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.

The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.

Jesus even i dont understand this Goodluck.




0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 1

Author Comment

by:icb01co2
Comment Utility
...more detail

TagTree 1


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

.....Becomes


TagTree 2


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>


-----------------------------------------------------------------------------------

TagTree 3

           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>



So the TreeTree object looks like this:

                                  TagTree1
                                  /            \
                                /                \
      TagTree 2(Section A)           TagTree 3 (Section B and C)

and after next split:

                                   TagTree1
                                  /            \
                                /                \
    TagTree 2(Section A)           TagTree 3 (Section B and C)
                                                /            \
                                              /                \
                    TagTree 4(Section B)           TagTree 5(Section C)








0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Okay - I'm going to need to read over all of this a few more times lol -- I'll get back to you as soon as I've got an idea :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
Any way i think this is too much info i just really need to know what consists a split in a page. In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute.

Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.

Thanks, Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
hmm :o\ That's really complicated. lol.

>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?

ta,
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
lol, no problems. What i mean is that the </td><td> tags split the table vertically. This in its self doesn't constitute a logical layout segmentation, by this i mean that as humans we have the ability to look at a site and logically segmant the page into sections.

What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
0
 
LVL 1

Author Comment

by:icb01co2
Comment Utility
Anyway you have been more than patient and helped me no end. Ill give you these points before i forget, thanks for everything.

Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Thanks for the points Chris.
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.

Best of luck with it,
[r.D]
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now