icb01co2
asked on
HTML parsing - page segmentation
Hi Experts,
This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -
- Theoretical ideas on how humans instinctively segment a web page document into sections.
- What can constitue vertical splits in HTML other than table and div tags?
- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?
- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.
-Some web pages use JavaScript to change layout of a page dependent on what
browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the
javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?
I know its alot but any help would be appreciated.
Thanks, Chris.
This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -
- Theoretical ideas on how humans instinctively segment a web page document into sections.
- What can constitue vertical splits in HTML other than table and div tags?
- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?
- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.
-Some web pages use JavaScript to change layout of a page dependent on what
browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the
javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?
I know its alot but any help would be appreciated.
Thanks, Chris.
ASKER
Wow, thanks for your help and quick response i thought i'd just rambled on and no one would understand what i was going on.
lol -- I always assume the same when I post a Q.
anyways, back to the Q...
>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???
Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)
[r.D]
anyways, back to the Q...
>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???
Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)
[r.D]
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
sorry im having a few com problems. Ill get back to you asap.
Sure
ASKER
ok sorry about that, ill try and explain what i wanna do but i may not be able to. I'll have four main objects:
TagData: tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)
TagTree: basic data sctructure for holding a tags frequency and attribute info
to be used to hold tag info as a tree data structure.
TreeTree: data structure that holds trees after segmentation.
mmm if that doesnt make sense then consider this: -
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.
The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.
Jesus even i dont understand this Goodluck.
TagData: tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)
TagTree: basic data sctructure for holding a tags frequency and attribute info
to be used to hold tag info as a tree data structure.
TreeTree: data structure that holds trees after segmentation.
mmm if that doesnt make sense then consider this: -
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.
The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.
Jesus even i dont understand this Goodluck.
ASKER
...more detail
TagTree 1
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
.....Becomes
TagTree 2
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
-------------------------- ---------- ---------- ---------- ---------- ---------- -------
TagTree 3
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
So the TreeTree object looks like this:
TagTree1
/ \
/ \
TagTree 2(Section A) TagTree 3 (Section B and C)
and after next split:
TagTree1
/ \
/ \
TagTree 2(Section A) TagTree 3 (Section B and C)
/ \
/ \
TagTree 4(Section B) TagTree 5(Section C)
TagTree 1
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
.....Becomes
TagTree 2
<table border="1" width = "400" height = "400">
<tr>
<td> SECTION A </td>
--------------------------
TagTree 3
<td> <table border="1" width = "100%" height = "100%">
<tr><td > SECTION B </td></tr>
<tr><td> SECTION C </td></tr>
</table>
</table>
So the TreeTree object looks like this:
TagTree1
/ \
/ \
TagTree 2(Section A) TagTree 3 (Section B and C)
and after next split:
TagTree1
/ \
/ \
TagTree 2(Section A) TagTree 3 (Section B and C)
/ \
/ \
TagTree 4(Section B) TagTree 5(Section C)
Okay - I'm going to need to read over all of this a few more times lol -- I'll get back to you as soon as I've got an idea :)
[r.D]
[r.D]
ASKER
Any way i think this is too much info i just really need to know what consists a split in a page. In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute.
Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.
Thanks, Chris.
Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.
Thanks, Chris.
hmm :o\ That's really complicated. lol.
>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?
ta,
[r.D]
>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?
ta,
[r.D]
ASKER
lol, no problems. What i mean is that the </td><td> tags split the table vertically. This in its self doesn't constitute a logical layout segmentation, by this i mean that as humans we have the ability to look at a site and logically segmant the page into sections.
What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
ASKER
Anyway you have been more than patient and helped me no end. Ill give you these points before i forget, thanks for everything.
Chris.
Chris.
Thanks for the points Chris.
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.
Best of luck with it,
[r.D]
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.
Best of luck with it,
[r.D]
check out: www.javaalmanac.com and search for "HTML". And check out some of the results.
Use the java.util.* package to tokenize the document perhaps.
>"is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?"
Sure, use the code here to download an image:
http://www.javaalmanac.com/egs/java.net/GetImage.html
Then, apply the Image to a BufferedImage, and use the code here to get the res of the image:
http://www.javaalmanac.com/egs/java.awt.image/ImagePixel.html
I'll have a read over some of the other problems, and see what I can do ....
HTH :)
[r.D]