Solved

HTML parsing - page segmentation

Posted on 2004-10-17
14
390 Views
Last Modified: 2011-10-03
Hi Experts,

           This is more of a group brainstorm than question and any ideas would be appriciated. I am about to develop a program in Java to parse a HTML document to a tree structure based upon its tags. I wanna then segment that tree structure into smaller trees that represent segments of a web page based upon layout styles, i.e. a tree to represent the menu elements, on to represent a seciion of text etc. I want to know if any of you have any experience with any of the following : -

- Theoretical ideas on how humans instinctively segment a web page document into sections.

- What can constitue vertical splits in HTML other than table and div tags?

- Horizontal splits in pages can sometimes be done using thin long images, is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?

- Can anyone thing of any other ways of obvious page segmentation other than - bg colour, bg images, border.  

-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?

I know its alot but any help would be appreciated.
Thanks, Chris.
0
Comment
Question by:icb01co2
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
14 Comments
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333665
Hi,
check out: www.javaalmanac.com  and search for "HTML". And check out some of the results.
Use the java.util.* package to tokenize the document perhaps.

>"is it possiable (using Java) to evaluate the height and width of an image(if not specified in the HTML) using its URL as an argument?"
Sure, use the code here to download an image:
http://www.javaalmanac.com/egs/java.net/GetImage.html
Then, apply the Image to a BufferedImage, and use the code here to get the res of the image:
http://www.javaalmanac.com/egs/java.awt.image/ImagePixel.html

I'll have a read over some of the other problems, and see what I can do ....

HTH :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333679
Wow, thanks for your help and quick response i thought i'd just rambled on and no one would understand what i was going on.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333693
lol -- I always assume the same when I post a Q.

anyways, back to the Q...

>"Theoretical ideas on how humans instinctively segment a web page document into sections."
So, also going by what you've said prior to that quote, you basically want to read a document, and for each page, create sections. One with "content/text" one with "links" and one with imagery, for example???

Could you elaborate a little more on what you want to do exactly, so that I can get a clearer 'picture', and hopefully help a bit more :)

[r.D]
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 9

Accepted Solution

by:
DrWarezz earned 325 total points
ID: 12333715
also..
>"- What can constitue vertical splits in HTML other than table and div tags?"
I don't personally know of any other method.

and
>"-Some web pages use JavaScript to change layout of a page dependent on what  
 browser/screen resolution etc. Is it possiable to make (using Java) a program that forces the  
 javaScript in a web document to be performed server-side, hence returning only HTML code. If not can you think of any other way around this?"
You can't make the JavaScript perform server-side, however, don't most webpages that do this redirect you, dependant on the screen resolution? In which case, the redirection should prevent you from getting any JavaScript, and will only display the appropriate page.
Or, you could simply extract (and ignore) all JavaScript. (Simply remove everything between the "<script></script>" tags)...?
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333767
sorry im having a few com problems. Ill get back to you asap.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333772
Sure
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333944
ok sorry about that, ill try and explain what i wanna do but i may not be able to. I'll have four main objects:


TagData:  tag type, position, attributes, text within tag etc
TagNode: made up of tag data and three other TagNodes (parent, sibling, child)


TagTree:  basic data sctructure for holding a tags frequency and attribute info
               to be used to hold tag info as a tree data structure.


TreeTree: data structure that holds trees after segmentation.

mmm if that doesnt make sense then consider this: -


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

So here there will be one TagTree contained in the TreeTree object which contains all TagNodes with set TagData.

The TagTree will be traversed and at the first indication of segmentation the tree would be split into two seperate TagTree Objects and stored as TreeTree elements.

Jesus even i dont understand this Goodluck.




0
 
LVL 1

Author Comment

by:icb01co2
ID: 12333982
...more detail

TagTree 1


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>
           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>

.....Becomes


TagTree 2


<table border="1" width = "400" height = "400">
         <tr>
             <td> SECTION A </td>


-----------------------------------------------------------------------------------

TagTree 3

           
             <td> <table border="1"  width = "100%" height = "100%">
                             <tr><td > SECTION B </td></tr>
                             <tr><td> SECTION C </td></tr>
                     </table>
</table>



So the TreeTree object looks like this:

                                  TagTree1
                                  /            \
                                /                \
      TagTree 2(Section A)           TagTree 3 (Section B and C)

and after next split:

                                   TagTree1
                                  /            \
                                /                \
    TagTree 2(Section A)           TagTree 3 (Section B and C)
                                                /            \
                                              /                \
                    TagTree 4(Section B)           TagTree 5(Section C)








0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12333998
Okay - I'm going to need to read over all of this a few more times lol -- I'll get back to you as soon as I've got an idea :)
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334011
Any way i think this is too much info i just really need to know what consists a split in a page. In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute.

Gasp, i think ive said all i can. If you got any of that then great if not then no problem i think it helped to putting something down in writting.

Thanks, Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12334105
hmm :o\ That's really complicated. lol.

>"In my HTML code the segmentation is apparent with the table splits, with the table haveing a border="1" attribute."
What do you mean exactly by "table splits"?

ta,
[r.D]
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334147
lol, no problems. What i mean is that the </td><td> tags split the table vertically. This in its self doesn't constitute a logical layout segmentation, by this i mean that as humans we have the ability to look at a site and logically segmant the page into sections.

What im trying to identify is the ways in which a web page can be split so that humans would recognize it. In my example the table splits with the </td><td> tags but the table also has a border so the split is apparent. Another trick would be different background colours, background images etc.
0
 
LVL 1

Author Comment

by:icb01co2
ID: 12334173
Anyway you have been more than patient and helped me no end. Ill give you these points before i forget, thanks for everything.

Chris.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 12338760
Thanks for the points Chris.
I think I know what you mean now, however, I'm not too confident on exactly how to do this. :o\ So, I doubt I have the ability to help you any further anyways.

Best of luck with it,
[r.D]
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question