Solved

HTML Parser

Posted on 2002-05-10
28
401 Views
Last Modified: 2011-04-14
Hi @ all :-)

All I need is a procedure for parsing html files and filling the content of a linked list with tags, comments, text, etc. I have some problems with TIDY, because I can't find a real output routine, only for FILE output (stdout or normal file). Maybe someone already translated TIDY to such thing I want.

summary:
- parsing html file
- splitting into a list
- maybe using TIDY

Bye, TDS.
0
Comment
Question by:Mathias
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 13
  • 10
  • 2
  • +3
28 Comments
 
LVL 32

Expert Comment

by:jhance
ID: 7001018
Parsing HTML is not a trivial task due to the complexity of the format.  The good news is that there are any number of HTML parsers out there, so you don't have to invent your own.

http://www.w3.org/MarkUp/implementations.html
0
 
LVL 5

Expert Comment

by:proskig
ID: 7001596
I also needed HTML parser at some point in the past. I sticked with SP. http://www.jclark.com/sp/index.htm

I can also recommend it for your task.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7001858
proskig: I tested the SP program, but I can't compile it by myself under DOS with DJGPP GCC :-( The Make application say that the command line is too long.
I don't want to have too much work to rewrite or develop a parser. I simply want a simple parser which split the file in a B-Tree with tags and attributes.

jhance: On these pages the whole html language is explained but I can't find a real parser even in the libwww. It's all too complicated. I need to extract things out of the library and that's too much work.

I could post my Pascal code to everybody who wants to help me. I "only" need a translation to c/c++.

Bye, TDS.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 3

Author Comment

by:Mathias
ID: 7003049
proskig: Can you give me a working example for SGML (SP) which works under GCC ???

Bye, TDS.
0
 
LVL 32

Expert Comment

by:jhance
ID: 7003054
>>It's all too complicated

Sorry, but HTML is complicated to parse.  If you want simple solutions, pick a simple problem.

If you can't use and of these (complex) solutions, buy a 3rd party parsing library that needs to building and comes with tech support handholding.

Unfortunately, not all problems have simple solutions.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7003095
jhance: Give me you email and I will send you some sources in Pascal. You will see that I started a browser for a project two years ago and all things work well. It's not difficult if you have the overview.
I simply need the Parser I wrote in Pascal for C/C++. That's all. The rest I will translate on my own. But I have big difficulties with strings in C/C++. In Pascal all things are easy :-)

Bye, TDS.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 7004076
I wrote a C++ stream based SGML parser a while ago and it wasn't all that complicated. You just have to plan it carefully before you start hacking away.
Anyway, I can't give it away because I'm not working at that company any more... But the point is that it's very possible to write in a few days. I wrote mine as a state machine reading and processing by char from a stream. The major bitch were the java scripts and ill formed pages...

Of course, if you can find an open source implementation that suit your needs you should use it - I couln't find any.
You can also look at the perl implementation. It's really easy to use and you can integrate it with C++ code.
0
 
LVL 3

Author Comment

by:Mathias
ID: 7004142
The problem with parsing incorrect page isn't that difficult. I'm using TIDY before parsing and I think all things are in the right order after doing that. Could someone give me a small example of reading from a stream and a double linked list? I will program my own parser if nothing can be found on the web.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7004753
What is TIDY?  Where can I find it?

If it is already outputing to a steam, then just intercept each of the output calls and instead put the data into a linked list.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7004800
TIDY is an open-source project to solve errors in html code. That's good for a parser which can't accept errors.
http://tidy.sourceforge.net/
The problem is that tidy put its output to a file or stdout. How can I intercept this output?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7005359
Just as I thought.  That program ALREADY generates a hierarchical linked list that defines the HTML document as it parses it.  It then walks the tree to output to a text file.  That would be the only logical way to do such a thing.

So, all of your work is done!  Just examine the function named PrintTree in the file pprint.c.  That will show you how to access the nodes of the hierarchical document.  Notice that it makes recursive calls to work its way into deeper and deeper levels.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7006735
Yes, and that's my problem. This level system is too complicated for me. I'm just a beginner in c/c++. If you can give me an example of accessing the tree without a recursive procedure you can have the point :-)

bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7007231
Think about how HTML is arranged:

<body>
      <table>
              <tr>
                     <td>leaf</td>
                     <td>
                              <table>
                                       ...entire table in this td
                              </table>
                     </td>
              </tr>
              <tr>
                      <td>another leaf</td>
                      <td>another</td>
              </tr>
      </table>
</body>

Notice that many tags contain other tags.   And those tags contain other tags.  And those tags contain other tags.  A hieracrchical tree structure is the inherant layout of the data and the obvious way to work with such a layout is to use a recursvie algorithm.

In the above example, the same function that processes the outer <table> can he used to process the <table> that is embedded in the <td> of the outer table.

I suppose there is a way to process a tree structure without using recursion, but I can't think a a reason fro doing it.  Is this an assignment for school?

I don't really understand your true goal here (and I'm pretty sure that YOU don't know what you want to do!).  You need to explore the tree structure to get a feel for how it works.  Use a debugger and trace through the steps that occur during the print generation.  You will see each node -- its text and attributes -- by putting the variable named node into the Watch window.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7007681
First of all it's not for school. Sorry, but I'm a good programmer in Pascal. There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser. Now I want to port this stuff to another GUI written in c/c++. I don't need a description of how html works, I think with my Pascal code I solved already many problems. My only problem is the translation to c/c++.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7007756
>>There is a GUI for Pascal from Arsene v. Wyss. I coded some stuff for it, e.g. the Browser.

Are you trying to write a C++ GUI frontend for TIDY?

Is this question about how to capture the final text?  The text that is generated by TIDY after it has done all of the cleanup and indenting?

If so, the answer is easy.

-- Dan
0
 
LVL 2

Expert Comment

by:Andrey_Kulik
ID: 7008623
Hi TDS,

if you have well-formed(1. start-tag has end-tag 2. stack-like tag's order) HTML then you could use any XML parser...

hope helps
Andrey
0
 
LVL 3

Author Comment

by:Mathias
ID: 7009711
No, i don't want to write a C++ GUI for TIDY. I simply need a html parser which put its output to a linked list. The output will be like a browser output, a real html view. I will try to recode TIDY output. Maybe I can solve the problem on my own.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7009757
>>The output will be like a browser output, a real html view.

Your goal is not coming clear to me:

*  Browsers don't have output (other than on-screen display)
*  TIDY already outputs real HTML
*  HTML is not naturally a linked list; it is a stream of text.
*  An HTML view can be obtained by using a Browser.

I'm sure that you have a clear understanding of what you want to do.  But you simply have not explained it in sensible terms.  

If you take a few minutes to explain your needs, there is every chance that somebody here can help you.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7010292
I think you can't or want understand me. Browsers don't have an output? Hmm, I'm writing a browser for a GUI and there is an output. TIDY put it's output into a FILE and I don't understand how to rewrite the PrintTree function to have simpley the linked list which I can proceed. HTML is of course a simple file, but if I parse that file, fill a linked list and output the formatted text to the screen it will be a visual html view.
I hope you understand my goal: I simply want an example of using the TIDY "PrintTree" function without an output to a file. It should be a stream instead.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7010534
So all you want to do is display TIDY's cleaned up HTML as text?  Just as if you viewed it in Notepad (but without having to save it to a file first)?

Or are you actually writing a browser, like Netscape Internet Explorer?  And if so, why on earth would you want to do that?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7010622
Your second idea is right. Why? Hmm, I'm a programmer in a company and need some other things in my spare time so I develop an own OS with some others. This os should have a html browser, too. That is my part...

Bye, TDS.
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 300 total points
ID: 7011678
I fully understand now.  Thanks for the clarification.  Before I answer you question, could you help me with one thing?... Some writer friends and I are currently working on an Encyclopedia.  My task is M-Z.  What I need to know is this:  Is "Maaaaa" a word?  What is the correct spelling?  Does it have five A's or six? If you have any time, can you also look into "Maaaab" for me, too?

-=-==-=-=-=-=-
In the TIDY code, you will find a function named cout.  Every character that gets output goes through that function, so replace that function with your own and you have a stream of data.  

But your best points to intercept are these functions (all in the pprint.c file):

      PPrintTag(...)
      PPrintEndTag(...)
      PPrintComment(...)
      PPrintAttrs(...)
      PPrintAttribute(...)
      PPrintAttrValue(...)

etc.  For instance, in PPrintTag, you get a Node* the value of node->element is a char* to the tag text.  For instance, it if is "B" ir "I" it will indicate that you need to display the following text in Bold or Italics, respectively.  If it is "IMG" then node->attributes will be a pointer to a linked list of attributes (_attval structures).  One of them will have "SRC" as the 'attribute' and (e.g..) "../images/somepic.jpg" as the 'value'.  

It will all be very easy.  I estimate that you can have a complete html browser up and running is a few short decades -- about the time that the Internet access is accomplished mainly through Mental Telepathy.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7011898
Hmm, okay, I've searched the knowledge databases around the web for the words and I couldn't find a declaration :-( "Maaaaa" is something like the speech of a sheep :-)

Bye, TDS:
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7022945
Hi TDS,
Do you have any additional questions?  DO any comments need clarification?

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7023211
Hmm, currently I try to rewrite the code but I'm very confused :-( In my eyes it's very difficult.

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7023307
Yes, it is difficult.  

It is kind of like writing the M-Z section of an encyclopedia when you're not even sure if there is a word that is spelled 'Maaaaa' so there is no place to get started.

The good news is that the TIDY program already does most of the hard part for you.  You need to get the program running, and prepare a very simple HTML file for it to read.  Execute TIDY in the debugger and put a breakpoint in PPrintTag and then when execution breaks, single-step though the code.  Do this many times.  Use the Variables window to examine the variables.  Study each of the structures so that you know what is in each.

-- Dan
0
 
LVL 3

Author Comment

by:Mathias
ID: 7024160
Yeah, that's clear in my eyes :-)
I hope I will finish my work after the A-Levels.

PS: What to do with the points?

Bye, TDS.
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7025485
>>What to do with the points?

I suggest that you award them to me, as I have aanswered your original question quite completely

I'll listen to differing opions...

-- Dan
0

Featured Post

[Webinar] Learn How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question