HTML Parser

Hi @ all :-)

All I need is a procedure for parsing html files and filling the content of a linked list with tags, comments, text, etc. I have some problems with TIDY, because I can't find a real output routine, only for FILE output (stdout or normal file). Maybe someone already translated TIDY to such thing I want.

summary:
- parsing html file
- splitting into a list
- maybe using TIDY

Bye, TDS.
LVL 3
MathiasIT SpecialistAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
DanRollinsConnect With a Mentor Commented:
I fully understand now.  Thanks for the clarification.  Before I answer you question, could you help me with one thing?... Some writer friends and I are currently working on an Encyclopedia.  My task is M-Z.  What I need to know is this:  Is "Maaaaa" a word?  What is the correct spelling?  Does it have five A's or six? If you have any time, can you also look into "Maaaab" for me, too?

-=-==-=-=-=-=-
In the TIDY code, you will find a function named cout.  Every character that gets output goes through that function, so replace that function with your own and you have a stream of data.  

But your best points to intercept are these functions (all in the pprint.c file):

      PPrintTag(...)
      PPrintEndTag(...)
      PPrintComment(...)
      PPrintAttrs(...)
      PPrintAttribute(...)
      PPrintAttrValue(...)

etc.  For instance, in PPrintTag, you get a Node* the value of node->element is a char* to the tag text.  For instance, it if is "B" ir "I" it will indicate that you need to display the following text in Bold or Italics, respectively.  If it is "IMG" then node->attributes will be a pointer to a linked list of attributes (_attval structures).  One of them will have "SRC" as the 'attribute' and (e.g..) "../images/somepic.jpg" as the 'value'.  

It will all be very easy.  I estimate that you can have a complete html browser up and running is a few short decades -- about the time that the Internet access is accomplished mainly through Mental Telepathy.

-- Dan
0
 
jhanceCommented:
Parsing HTML is not a trivial task due to the complexity of the format.  The good news is that there are any number of HTML parsers out there, so you don't have to invent your own.

http://www.w3.org/MarkUp/implementations.html
0
 
proskigCommented:
I also needed HTML parser at some point in the past. I sticked with SP. http://www.jclark.com/sp/index.htm

I can also recommend it for your task.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
MathiasIT SpecialistAuthor Commented:
proskig: I tested the SP program, but I can't compile it by myself under DOS with DJGPP GCC :-( The Make application say that the command line is too long.
I don't want to have too much work to rewrite or develop a parser. I simply want a simple parser which split the file in a B-Tree with tags and attributes.

jhance: On these pages the whole html language is explained but I can't find a real parser even in the libwww. It's all too complicated. I need to extract things out of the library and that's too much work.

I could post my Pascal code to everybody who wants to help me. I "only" need a translation to c/c++.

Bye, TDS.
0
 
MathiasIT SpecialistAuthor Commented:
proskig: Can you give me a working example for SGML (SP) which works under GCC ???

Bye, TDS.
0
 
jhanceCommented:
>>It's all too complicated

Sorry, but HTML is complicated to parse.  If you want simple solutions, pick a simple problem.

If you can't use and of these (complex) solutions, buy a 3rd party parsing library that needs to building and comes with tech support handholding.

Unfortunately, not all problems have simple solutions.
0
 
MathiasIT SpecialistAuthor Commented:
jhance: Give me you email and I will send you some sources in Pascal. You will see that I started a browser for a project two years ago and all things work well. It's not difficult if you have the overview.
I simply need the Parser I wrote in Pascal for C/C++. That's all. The rest I will translate on my own. But I have big difficulties with strings in C/C++. In Pascal all things are easy :-)

Bye, TDS.
0
 
AssafLavieCommented:
I wrote a C++ stream based SGML parser a while ago and it wasn't all that complicated. You just have to plan it carefully before you start hacking away.
Anyway, I can't give it away because I'm not working at that company any more... But the point is that it's very possible to write in a few days. I wrote mine as a state machine reading and processing by char from a stream. The major bitch were the java scripts and ill formed pages...

Of course, if you can find an open source implementation that suit your needs you should use it - I couln't find any.
You can also look at the perl implementation. It's really easy to use and you can integrate it with C++ code.
0
 
MathiasIT SpecialistAuthor Commented:
The problem with parsing incorrect page isn't that difficult. I'm using TIDY before parsing and I think all things are in the right order after doing that. Could someone give me a small example of reading from a stream and a double linked list? I will program my own parser if nothing can be found on the web.

Bye, TDS.
0
 
DanRollinsCommented:
What is TIDY?  Where can I find it?

If it is already outputing to a steam, then just intercept each of the output calls and instead put the data into a linked list.

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
TIDY is an open-source project to solve errors in html code. That's good for a parser which can't accept errors.
http://tidy.sourceforge.net/
The problem is that tidy put its output to a file or stdout. How can I intercept this output?

Bye, TDS.
0
 
DanRollinsCommented:
Just as I thought.  That program ALREADY generates a hierarchical linked list that defines the HTML document as it parses it.  It then walks the tree to output to a text file.  That would be the only logical way to do such a thing.

So, all of your work is done!  Just examine the function named PrintTree in the file pprint.c.  That will show you how to access the nodes of the hierarchical document.  Notice that it makes recursive calls to work its way into deeper and deeper levels.

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
Yes, and that's my problem. This level system is too complicated for me. I'm just a beginner in c/c++. If you can give me an example of accessing the tree without a recursive procedure you can have the point :-)

bye, TDS.
0
 
DanRollinsCommented:
Think about how HTML is arranged:

<body>
      <table>
              <tr>
                     <td>leaf</td>
                     <td>
                              <table>
                                       ...entire table in this td
                              </table>
                     </td>
              </tr>
              <tr>
                      <td>another leaf</td>
                      <td>another</td>
              </tr>
      </table>
</body>

Notice that many tags contain other tags.   And those tags contain other tags.  And those tags contain other tags.  A hieracrchical tree structure is the inherant layout of the data and the obvious way to work with such a layout is to use a recursvie algorithm.

In the above example, the same function that processes the outer <table> can he used to process the <table> that is embedded in the <td> of the outer table.

I suppose there is a way to process a tree structure without using recursion, but I can't think a a reason fro doing it.  Is this an assignment for school?

I don't really understand your true goal here (and I'm pretty sure that YOU don't know what you want to do!).  You need to explore the tree structure to get a feel for how it works.  Use a debugger and trace through the steps that occur during the print generation.  You will see each node -- its text and attributes -- by putting the variable named node into the Watch window.

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
First of all it's not for school. Sorry, but I'm a good programmer in Pascal. There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser. Now I want to port this stuff to another GUI written in c/c++. I don't need a description of how html works, I think with my Pascal code I solved already many problems. My only problem is the translation to c/c++.

Bye, TDS.
0
 
DanRollinsCommented:
>>There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser.

Are you trying to write a C++ GUI frontend for TIDY?

Is this question about how to capture the final text?  The text that is generated by TIDY after it has done all of the cleanup and indenting?

If so, the answer is easy.

-- Dan
0
 
Andrey_KulikCommented:
Hi TDS,

if you have well-formed(1. start-tag has end-tag 2. stack-like tag's order) HTML then you could use any XML parser...

hope helps
Andrey
0
 
MathiasIT SpecialistAuthor Commented:
No, i don't want to write a C++ GUI for TIDY. I simply need a html parser which put its output to a linked list. The output will be like a browser output, a real html view. I will try to recode TIDY output. Maybe I can solve the problem on my own.

Bye, TDS.
0
 
DanRollinsCommented:
>>The output will be like a browser output, a real html view.

Your goal is not coming clear to me:

*  Browsers don't have output (other than on-screen display)
*  TIDY already outputs real HTML
*  HTML is not naturally a linked list; it is a stream of text.
*  An HTML view can be obtained by using a Browser.

I'm sure that you have a clear understanding of what you want to do.  But you simply have not explained it in sensible terms.  

If you take a few minutes to explain your needs, there is every chance that somebody here can help you.

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
I think you can't or want understand me. Browsers don't have an output? Hmm, I'm writing a browser for a GUI and there is an output. TIDY put it's output into a FILE and I don't understand how to rewrite the PrintTree function to have simpley the linked list which I can proceed. HTML is of course a simple file, but if I parse that file, fill a linked list and output the formatted text to the screen it will be a visual html view.
I hope you understand my goal: I simply want an example of using the TIDY "PrintTree" function without an output to a file. It should be a stream instead.

Bye, TDS.
0
 
DanRollinsCommented:
So all you want to do is display TIDY's cleaned up HTML as text?  Just as if you viewed it in Notepad (but without having to save it to a file first)?

Or are you actually writing a browser, like Netscape Internet Explorer?  And if so, why on earth would you want to do that?

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
Your second idea is right. Why? Hmm, I'm a programmer in a company and need some other things in my spare time so I develop an own OS with some others. This os should have a html browser, too. That is my part...

Bye, TDS.
0
 
MathiasIT SpecialistAuthor Commented:
Hmm, okay, I've searched the knowledge databases around the web for the words and I couldn't find a declaration :-( "Maaaaa" is something like the speech of a sheep :-)

Bye, TDS:
0
 
DanRollinsCommented:
Hi TDS,
Do you have any additional questions?  DO any comments need clarification?

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
Hmm, currently I try to rewrite the code but I'm very confused :-( In my eyes it's very difficult.

Bye, TDS.
0
 
DanRollinsCommented:
Yes, it is difficult.  

It is kind of like writing the M-Z section of an encyclopedia when you're not even sure if there is a word that is spelled 'Maaaaa' so there is no place to get started.

The good news is that the TIDY program already does most of the hard part for you.  You need to get the program running, and prepare a very simple HTML file for it to read.  Execute TIDY in the debugger and put a breakpoint in PPrintTag and then when execution breaks, single-step though the code.  Do this many times.  Use the Variables window to examine the variables.  Study each of the structures so that you know what is in each.

-- Dan
0
 
MathiasIT SpecialistAuthor Commented:
Yeah, that's clear in my eyes :-)
I hope I will finish my work after the A-Levels.

PS: What to do with the points?

Bye, TDS.
0
 
DanRollinsCommented:
>>What to do with the points?

I suggest that you award them to me, as I have aanswered your original question quite completely

I'll listen to differing opions...

-- Dan
0
All Courses

From novice to tech pro — start learning today.