Solved

HTML Parser

Posted on 2002-05-10
28
393 Views
Last Modified: 2011-04-14
Hi @ all :-)

All I need is a procedure for parsing html files and filling the content of a linked list with tags, comments, text, etc. I have some problems with TIDY, because I can't find a real output routine, only for FILE output (stdout or normal file). Maybe someone already translated TIDY to such thing I want.

summary:
- parsing html file
- splitting into a list
- maybe using TIDY

Bye, TDS.
0
Comment
Question by:Mathias
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 13
  • 10
  • 2
  • +3
28 Comments
 
LVL 32

Expert Comment

by:jhance
ID: 7001018
Parsing HTML is not a trivial task due to the complexity of the format.  The good news is that there are any number of HTML parsers out there, so you don't have to invent your own.

http://www.w3.org/MarkUp/implementations.html
0
 
LVL 5

Expert Comment

by:proskig
ID: 7001596
I also needed HTML parser at some point in the past. I sticked with SP. http://www.jclark.com/sp/index.htm

I can also recommend it for your task.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7001858
proskig: I tested the SP program, but I can't compile it by myself under DOS with DJGPP GCC :-( The Make application say that the command line is too long.
I don't want to have too much work to rewrite or develop a parser. I simply want a simple parser which split the file in a B-Tree with tags and attributes.

jhance: On these pages the whole html language is explained but I can't find a real parser even in the libwww. It's all too complicated. I need to extract things out of the library and that's too much work.

I could post my Pascal code to everybody who wants to help me. I "only" need a translation to c/c++.

Bye, TDS.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 3

Author Comment

by:Mathias
ID: 7003049
proskig: Can you give me a working example for SGML (SP) which works under GCC ???

Bye, TDS.
0
 
LVL 32

Expert Comment

by:jhance
ID: 7003054
>>It's all too complicated

Sorry, but HTML is complicated to parse.  If you want simple solutions, pick a simple problem.

If you can't use and of these (complex) solutions, buy a 3rd party parsing library that needs to building and comes with tech support handholding.

Unfortunately, not all problems have simple solutions.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7003095
jhance: Give me you email and I will send you some sources in Pascal. You will see that I started a browser for a project two years ago and all things work well. It's not difficult if you have the overview.
I simply need the Parser I wrote in Pascal for C/C++. That's all. The rest I will translate on my own. But I have big difficulties with strings in C/C++. In Pascal all things are easy :-)

Bye, TDS.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 7004076
I wrote a C++ stream based SGML parser a while ago and it wasn't all that complicated. You just have to plan it carefully before you start hacking away.
Anyway, I can't give it away because I'm not working at that company any more... But the point is that it's very possible to write in a few days. I wrote mine as a state machine reading and processing by char from a stream. The major bitch were the java scripts and ill formed pages...

Of course, if you can find an open source implementation that suit your needs you should use it - I couln't find any.
You can also look at the perl implementation. It's really easy to use and you can integrate it with C++ code.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7004142
The problem with parsing incorrect page isn't that difficult. I'm using TIDY before parsing and I think all things are in the right order after doing that. Could someone give me a small example of reading from a stream and a double linked list? I will program my own parser if nothing can be found on the web.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7004753
What is TIDY?  Where can I find it?

If it is already outputing to a steam, then just intercept each of the output calls and instead put the data into a linked list.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7004800
TIDY is an open-source project to solve errors in html code. That's good for a parser which can't accept errors.
http://tidy.sourceforge.net/
The problem is that tidy put its output to a file or stdout. How can I intercept this output?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7005359
Just as I thought.  That program ALREADY generates a hierarchical linked list that defines the HTML document as it parses it.  It then walks the tree to output to a text file.  That would be the only logical way to do such a thing.

So, all of your work is done!  Just examine the function named PrintTree in the file pprint.c.  That will show you how to access the nodes of the hierarchical document.  Notice that it makes recursive calls to work its way into deeper and deeper levels.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7006735
Yes, and that's my problem. This level system is too complicated for me. I'm just a beginner in c/c++. If you can give me an example of accessing the tree without a recursive procedure you can have the point :-)

bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7007231
Think about how HTML is arranged:

<body>
      <table>
              <tr>
                     <td>leaf</td>
                     <td>
                              <table>
                                       ...entire table in this td
                              </table>
                     </td>
              </tr>
              <tr>
                      <td>another leaf</td>
                      <td>another</td>
              </tr>
      </table>
</body>

Notice that many tags contain other tags.   And those tags contain other tags.  And those tags contain other tags.  A hieracrchical tree structure is the inherant layout of the data and the obvious way to work with such a layout is to use a recursvie algorithm.

In the above example, the same function that processes the outer <table> can he used to process the <table> that is embedded in the <td> of the outer table.

I suppose there is a way to process a tree structure without using recursion, but I can't think a a reason fro doing it.  Is this an assignment for school?

I don't really understand your true goal here (and I'm pretty sure that YOU don't know what you want to do!).  You need to explore the tree structure to get a feel for how it works.  Use a debugger and trace through the steps that occur during the print generation.  You will see each node -- its text and attributes -- by putting the variable named node into the Watch window.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7007681
First of all it's not for school. Sorry, but I'm a good programmer in Pascal. There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser. Now I want to port this stuff to another GUI written in c/c++. I don't need a description of how html works, I think with my Pascal code I solved already many problems. My only problem is the translation to c/c++.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7007756
>>There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser.

Are you trying to write a C++ GUI frontend for TIDY?

Is this question about how to capture the final text?  The text that is generated by TIDY after it has done all of the cleanup and indenting?

If so, the answer is easy.

-- Dan
0
 
LVL 2

Expert Comment

by:Andrey_Kulik
ID: 7008623
Hi TDS,

if you have well-formed(1. start-tag has end-tag 2. stack-like tag's order) HTML then you could use any XML parser...

hope helps
Andrey
0
 
LVL 3

Author Comment

by:Mathias
ID: 7009711
No, i don't want to write a C++ GUI for TIDY. I simply need a html parser which put its output to a linked list. The output will be like a browser output, a real html view. I will try to recode TIDY output. Maybe I can solve the problem on my own.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7009757
>>The output will be like a browser output, a real html view.

Your goal is not coming clear to me:

*  Browsers don't have output (other than on-screen display)
*  TIDY already outputs real HTML
*  HTML is not naturally a linked list; it is a stream of text.
*  An HTML view can be obtained by using a Browser.

I'm sure that you have a clear understanding of what you want to do.  But you simply have not explained it in sensible terms.  

If you take a few minutes to explain your needs, there is every chance that somebody here can help you.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7010292
I think you can't or want understand me. Browsers don't have an output? Hmm, I'm writing a browser for a GUI and there is an output. TIDY put it's output into a FILE and I don't understand how to rewrite the PrintTree function to have simpley the linked list which I can proceed. HTML is of course a simple file, but if I parse that file, fill a linked list and output the formatted text to the screen it will be a visual html view.
I hope you understand my goal: I simply want an example of using the TIDY "PrintTree" function without an output to a file. It should be a stream instead.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7010534
So all you want to do is display TIDY's cleaned up HTML as text?  Just as if you viewed it in Notepad (but without having to save it to a file first)?

Or are you actually writing a browser, like Netscape Internet Explorer?  And if so, why on earth would you want to do that?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7010622
Your second idea is right. Why? Hmm, I'm a programmer in a company and need some other things in my spare time so I develop an own OS with some others. This os should have a html browser, too. That is my part...

Bye, TDS.
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 300 total points
ID: 7011678
I fully understand now.  Thanks for the clarification.  Before I answer you question, could you help me with one thing?... Some writer friends and I are currently working on an Encyclopedia.  My task is M-Z.  What I need to know is this:  Is "Maaaaa" a word?  What is the correct spelling?  Does it have five A's or six? If you have any time, can you also look into "Maaaab" for me, too?

-=-==-=-=-=-=-
In the TIDY code, you will find a function named cout.  Every character that gets output goes through that function, so replace that function with your own and you have a stream of data.  

But your best points to intercept are these functions (all in the pprint.c file):

      PPrintTag(...)
      PPrintEndTag(...)
      PPrintComment(...)
      PPrintAttrs(...)
      PPrintAttribute(...)
      PPrintAttrValue(...)

etc.  For instance, in PPrintTag, you get a Node* the value of node->element is a char* to the tag text.  For instance, it if is "B" ir "I" it will indicate that you need to display the following text in Bold or Italics, respectively.  If it is "IMG" then node->attributes will be a pointer to a linked list of attributes (_attval structures).  One of them will have "SRC" as the 'attribute' and (e.g..) "../images/somepic.jpg" as the 'value'.  

It will all be very easy.  I estimate that you can have a complete html browser up and running is a few short decades -- about the time that the Internet access is accomplished mainly through Mental Telepathy.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7011898
Hmm, okay, I've searched the knowledge databases around the web for the words and I couldn't find a declaration :-( "Maaaaa" is something like the speech of a sheep :-)

Bye, TDS:
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7022945
Hi TDS,
Do you have any additional questions?  DO any comments need clarification?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7023211
Hmm, currently I try to rewrite the code but I'm very confused :-( In my eyes it's very difficult.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7023307
Yes, it is difficult.  

It is kind of like writing the M-Z section of an encyclopedia when you're not even sure if there is a word that is spelled 'Maaaaa' so there is no place to get started.

The good news is that the TIDY program already does most of the hard part for you.  You need to get the program running, and prepare a very simple HTML file for it to read.  Execute TIDY in the debugger and put a breakpoint in PPrintTag and then when execution breaks, single-step though the code.  Do this many times.  Use the Variables window to examine the variables.  Study each of the structures so that you know what is in each.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7024160
Yeah, that's clear in my eyes :-)
I hope I will finish my work after the A-Levels.

PS: What to do with the points?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7025485
>>What to do with the points?

I suggest that you award them to me, as I have aanswered your original question quite completely

I'll listen to differing opions...

-- Dan
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
thread-safe code in c++ 2 128
Best book to learn C++ 4 97
Microsoft C++ code failing in executable that worked 9 144
DCT of  2D array using fftw in c++ 9 134
In days of old, returning something by value from a function in C++ was necessarily avoided because it would, invariably, involve one or even two copies of the object being created and potentially costly calls to a copy-constructor and destructor. A…
  Included as part of the C++ Standard Template Library (STL) is a collection of generic containers. Each of these containers serves a different purpose and has different pros and cons. It is often difficult to decide which container to use and …
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question