Solved

HTML Parser

Posted on 2002-05-10
28
359 Views
Last Modified: 2011-04-14
Hi @ all :-)

All I need is a procedure for parsing html files and filling the content of a linked list with tags, comments, text, etc. I have some problems with TIDY, because I can't find a real output routine, only for FILE output (stdout or normal file). Maybe someone already translated TIDY to such thing I want.

summary:
- parsing html file
- splitting into a list
- maybe using TIDY

Bye, TDS.
0
Comment
Question by:Mathias
  • 13
  • 10
  • 2
  • +3
28 Comments
 
LVL 32

Expert Comment

by:jhance
Comment Utility
Parsing HTML is not a trivial task due to the complexity of the format.  The good news is that there are any number of HTML parsers out there, so you don't have to invent your own.

http://www.w3.org/MarkUp/implementations.html
0
 
LVL 5

Expert Comment

by:proskig
Comment Utility
I also needed HTML parser at some point in the past. I sticked with SP. http://www.jclark.com/sp/index.htm

I can also recommend it for your task.
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
proskig: I tested the SP program, but I can't compile it by myself under DOS with DJGPP GCC :-( The Make application say that the command line is too long.
I don't want to have too much work to rewrite or develop a parser. I simply want a simple parser which split the file in a B-Tree with tags and attributes.

jhance: On these pages the whole html language is explained but I can't find a real parser even in the libwww. It's all too complicated. I need to extract things out of the library and that's too much work.

I could post my Pascal code to everybody who wants to help me. I "only" need a translation to c/c++.

Bye, TDS.
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
proskig: Can you give me a working example for SGML (SP) which works under GCC ???

Bye, TDS.
0
 
LVL 32

Expert Comment

by:jhance
Comment Utility
>>It's all too complicated

Sorry, but HTML is complicated to parse.  If you want simple solutions, pick a simple problem.

If you can't use and of these (complex) solutions, buy a 3rd party parsing library that needs to building and comes with tech support handholding.

Unfortunately, not all problems have simple solutions.
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
jhance: Give me you email and I will send you some sources in Pascal. You will see that I started a browser for a project two years ago and all things work well. It's not difficult if you have the overview.
I simply need the Parser I wrote in Pascal for C/C++. That's all. The rest I will translate on my own. But I have big difficulties with strings in C/C++. In Pascal all things are easy :-)

Bye, TDS.
0
 
LVL 4

Expert Comment

by:AssafLavie
Comment Utility
I wrote a C++ stream based SGML parser a while ago and it wasn't all that complicated. You just have to plan it carefully before you start hacking away.
Anyway, I can't give it away because I'm not working at that company any more... But the point is that it's very possible to write in a few days. I wrote mine as a state machine reading and processing by char from a stream. The major bitch were the java scripts and ill formed pages...

Of course, if you can find an open source implementation that suit your needs you should use it - I couln't find any.
You can also look at the perl implementation. It's really easy to use and you can integrate it with C++ code.
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
The problem with parsing incorrect page isn't that difficult. I'm using TIDY before parsing and I think all things are in the right order after doing that. Could someone give me a small example of reading from a stream and a double linked list? I will program my own parser if nothing can be found on the web.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
What is TIDY?  Where can I find it?

If it is already outputing to a steam, then just intercept each of the output calls and instead put the data into a linked list.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
TIDY is an open-source project to solve errors in html code. That's good for a parser which can't accept errors.
http://tidy.sourceforge.net/
The problem is that tidy put its output to a file or stdout. How can I intercept this output?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
Just as I thought.  That program ALREADY generates a hierarchical linked list that defines the HTML document as it parses it.  It then walks the tree to output to a text file.  That would be the only logical way to do such a thing.

So, all of your work is done!  Just examine the function named PrintTree in the file pprint.c.  That will show you how to access the nodes of the hierarchical document.  Notice that it makes recursive calls to work its way into deeper and deeper levels.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
Yes, and that's my problem. This level system is too complicated for me. I'm just a beginner in c/c++. If you can give me an example of accessing the tree without a recursive procedure you can have the point :-)

bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
Think about how HTML is arranged:

<body>
      <table>
              <tr>
                     <td>leaf</td>
                     <td>
                              <table>
                                       ...entire table in this td
                              </table>
                     </td>
              </tr>
              <tr>
                      <td>another leaf</td>
                      <td>another</td>
              </tr>
      </table>
</body>

Notice that many tags contain other tags.   And those tags contain other tags.  And those tags contain other tags.  A hieracrchical tree structure is the inherant layout of the data and the obvious way to work with such a layout is to use a recursvie algorithm.

In the above example, the same function that processes the outer <table> can he used to process the <table> that is embedded in the <td> of the outer table.

I suppose there is a way to process a tree structure without using recursion, but I can't think a a reason fro doing it.  Is this an assignment for school?

I don't really understand your true goal here (and I'm pretty sure that YOU don't know what you want to do!).  You need to explore the tree structure to get a feel for how it works.  Use a debugger and trace through the steps that occur during the print generation.  You will see each node -- its text and attributes -- by putting the variable named node into the Watch window.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
First of all it's not for school. Sorry, but I'm a good programmer in Pascal. There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser. Now I want to port this stuff to another GUI written in c/c++. I don't need a description of how html works, I think with my Pascal code I solved already many problems. My only problem is the translation to c/c++.

Bye, TDS.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
>>There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser.

Are you trying to write a C++ GUI frontend for TIDY?

Is this question about how to capture the final text?  The text that is generated by TIDY after it has done all of the cleanup and indenting?

If so, the answer is easy.

-- Dan
0
 
LVL 2

Expert Comment

by:Andrey_Kulik
Comment Utility
Hi TDS,

if you have well-formed(1. start-tag has end-tag 2. stack-like tag's order) HTML then you could use any XML parser...

hope helps
Andrey
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
No, i don't want to write a C++ GUI for TIDY. I simply need a html parser which put its output to a linked list. The output will be like a browser output, a real html view. I will try to recode TIDY output. Maybe I can solve the problem on my own.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
>>The output will be like a browser output, a real html view.

Your goal is not coming clear to me:

*  Browsers don't have output (other than on-screen display)
*  TIDY already outputs real HTML
*  HTML is not naturally a linked list; it is a stream of text.
*  An HTML view can be obtained by using a Browser.

I'm sure that you have a clear understanding of what you want to do.  But you simply have not explained it in sensible terms.  

If you take a few minutes to explain your needs, there is every chance that somebody here can help you.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
I think you can't or want understand me. Browsers don't have an output? Hmm, I'm writing a browser for a GUI and there is an output. TIDY put it's output into a FILE and I don't understand how to rewrite the PrintTree function to have simpley the linked list which I can proceed. HTML is of course a simple file, but if I parse that file, fill a linked list and output the formatted text to the screen it will be a visual html view.
I hope you understand my goal: I simply want an example of using the TIDY "PrintTree" function without an output to a file. It should be a stream instead.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
So all you want to do is display TIDY's cleaned up HTML as text?  Just as if you viewed it in Notepad (but without having to save it to a file first)?

Or are you actually writing a browser, like Netscape Internet Explorer?  And if so, why on earth would you want to do that?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
Your second idea is right. Why? Hmm, I'm a programmer in a company and need some other things in my spare time so I develop an own OS with some others. This os should have a html browser, too. That is my part...

Bye, TDS.
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 300 total points
Comment Utility
I fully understand now.  Thanks for the clarification.  Before I answer you question, could you help me with one thing?... Some writer friends and I are currently working on an Encyclopedia.  My task is M-Z.  What I need to know is this:  Is "Maaaaa" a word?  What is the correct spelling?  Does it have five A's or six? If you have any time, can you also look into "Maaaab" for me, too?

-=-==-=-=-=-=-
In the TIDY code, you will find a function named cout.  Every character that gets output goes through that function, so replace that function with your own and you have a stream of data.  

But your best points to intercept are these functions (all in the pprint.c file):

      PPrintTag(...)
      PPrintEndTag(...)
      PPrintComment(...)
      PPrintAttrs(...)
      PPrintAttribute(...)
      PPrintAttrValue(...)

etc.  For instance, in PPrintTag, you get a Node* the value of node->element is a char* to the tag text.  For instance, it if is "B" ir "I" it will indicate that you need to display the following text in Bold or Italics, respectively.  If it is "IMG" then node->attributes will be a pointer to a linked list of attributes (_attval structures).  One of them will have "SRC" as the 'attribute' and (e.g..) "../images/somepic.jpg" as the 'value'.  

It will all be very easy.  I estimate that you can have a complete html browser up and running is a few short decades -- about the time that the Internet access is accomplished mainly through Mental Telepathy.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
Hmm, okay, I've searched the knowledge databases around the web for the words and I couldn't find a declaration :-( "Maaaaa" is something like the speech of a sheep :-)

Bye, TDS:
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
Hi TDS,
Do you have any additional questions?  DO any comments need clarification?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
Hmm, currently I try to rewrite the code but I'm very confused :-( In my eyes it's very difficult.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
Yes, it is difficult.  

It is kind of like writing the M-Z section of an encyclopedia when you're not even sure if there is a word that is spelled 'Maaaaa' so there is no place to get started.

The good news is that the TIDY program already does most of the hard part for you.  You need to get the program running, and prepare a very simple HTML file for it to read.  Execute TIDY in the debugger and put a breakpoint in PPrintTag and then when execution breaks, single-step though the code.  Do this many times.  Use the Variables window to examine the variables.  Study each of the structures so that you know what is in each.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
Comment Utility
Yeah, that's clear in my eyes :-)
I hope I will finish my work after the A-Levels.

PS: What to do with the points?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
Comment Utility
>>What to do with the points?

I suggest that you award them to me, as I have aanswered your original question quite completely

I'll listen to differing opions...

-- Dan
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Written by John Humphreys C++ Threading and the POSIX Library This article will cover the basic information that you need to know in order to make use of the POSIX threading library available for C and C++ on UNIX and most Linux systems.   [s…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now