HTML parsing - page segmentation
Posted on 2004-10-17
This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -
- Theoretical ideas on how humans instinctively segment a web page document into sections.
- What can constitue vertical splits in HTML other than table and div tags?
- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?
- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.
browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the
I know its alot but any help would be appreciated.