Solved

Save html as text

Posted on 1998-09-29
9
317 Views
Last Modified: 2010-04-06
How can I convert an html file to a text file?
I think that this is possible with the HTML Control, but I need something that doesn't require that much resources.

Any help is greatly appreciated.
0
Comment
Question by:friberg
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341244
friberg

do you need the html tags as well or just the text in the html...

if you require everything in the html file here is a simple way to do it...

procedure TForm1.Button1Click(Sender: TObject);
begin
 Memo1.Lines.LoadFromFile('c:\temp\test.htm');
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
 Memo1.Lines.SaveToFile('c:\temp\test.txt');
end;

if you need to strip all the tags out of the file then you will need to search through the htm/html page and look for < /> etc. you can also load the htm/html page into mem if you don't want to use memos etc...

Later
BoRiS
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1341245
procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then InsideTag := true
    else if c = '>' then InsideTag := false
    else if not InsideTag then write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

------------------------

The above function procedure accepts two filenames.  All of the tags are extracted from the HTML File, and only the raw text from the HTML file is written to the text file.  That is all this does.  Note that no formatting is done to the text.

Also, to keep it simple, I didn't include anything in this procedure to substitute anything succeeding the & operator (such as <).  If the procedure encounters <, it is written to the text file as < rather than <.  You can easily do this by altering the line that I marked as {<-- writes to text file}.  Just text if c = '&' and then read the next two characters.
0
 

Author Comment

by:friberg
ID: 1341246
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 1

Expert Comment

by:duke_n
ID: 1341247
text file
with tags or w/o tags
0
 

Author Comment

by:friberg
ID: 1341248
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341249
friberg

sorry the answer was surposed to be sent as a comment until you told me what you needed with tags or without...

Later
BoRiS
0
 

Author Comment

by:friberg
ID: 1341250
scrapdog,

Your solution works very well with simple html files, it is very fast and reliable. But how can I modify it so html files that contain tables also are readable? I'd like a space character between columns and a new line for each row in the table.

Thanks.
0
 
LVL 5

Accepted Solution

by:
scrapdog earned 100 total points
ID: 1341251
I just wrote this, but didn't test it, but hopefully you can see how the logic works.  This does the same thing as the last piece of code I gave you, with table support.

When this program senses a table, it determines the number of rows and columns by the number of TR and TD tags found.  The text in the cells are lined up and left justified.

Again, this is simple, as it doesn't check the tags for errors.  It only knows where the table begins and ends, where the rows begin and end, and where the cells begin and end.  The row with the largest number of columns becomes the width of the table, and the largest cell in the whole table becomes the width of all the cells (which are padded with spaces).  Spaces are inserted between columns, and a new line is started at the end of a row.

Since I didn't test it, there might be syntactical errors.  If it is so bad that you can't fix it, let me know.

Here it is:
-----------

const
  TABLE_BEGIN = 1;
  TABLE_END   = 2;
  CELL_BEGIN  = 3;
  CELL_END    = 4;
  ROW_BEGIN   = 5;
  ROW_END     = 6;

  MAXTABLEROWS = 100;
  MAXTABLECOLS = 100;

type
   THTMLTable = record
                  Row, Col :integer;
                  Data  :array[0..MAXTABLEROWS, 0..MAXTABLECOLS] of string;
                end;



function GetTag(var HTMLFile :Text;) :integer;
var t :string;
    InTag :boolean;
    i, x :integer;
    c :char;
begin
  t := '';
  InTag := true;
  while (not eof(HTMLFile)) and InTag do begin
    read(HTMLFile, c);
    if c = '>' then Intag := False
    else t := t + c;
  end;
  for i := 1 to length(t) do t[i] := upcase(t[i]);
  if copy(t,1,5) = 'TABLE' then x := TABLE_BEGIN
  else if copy(t,1,6) = '/TABLE' then x := TABLE_END
  else if copy(t,1,2) = 'TR' then x := ROW_BEGIN
  else if copy(t,1,3) = '/TR' then x := ROW_END
  else if copy(t,1,2) = 'TD' then x := CELL_BEGIN
  else if copy(t,1,3) = '/TD' then x := CELL_END
  else x := 0;
  Result := x;
end;

procedure GetCell(var HTMLFile :Text; var Cell :string);
var InCell, HTag :boolean;
    c  :char;
begin
  Cell := '';
  InCell := true;
  while (not eof(HTMLFile)) and InCell do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if (HTag = CELL_END) then InCell := false;
    end
    else Cell := Cell + c;
  end;
end;

procedure GetTable(var HTMLFile :Text;  Table :THTMLTable);
var
  MaxCol :integer;
  InTable :boolean;
  c  :char;
  i,j,k,x :integer;
begin
  Table.Col := 0;
  Table.Row := 0;
  MaxCol := 0;
  InTable := true;
  while not(eof(HTMLFile)) and InTable do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      case HTag of
        TABLE_END:  InTable := false;
        ROW_BEGIN: begin
                     Table.Row := Table.Row + 1;
                   end;
        CELL_BEGIN: begin
                      GetCell(HTMLFile, Cell);
                      with Table do begin
                        Col := Col + 1;
                        Data[row, col] := Cell;
                        if Col > MaxCol then MaxCol := Col;
                      end;
                    end;
        ROW_END:  begin
                      if Table.Col < MaxCol then begin
                        for i := Table.Col+1 to MaxCol do
                          Data[Row, i] := '';
                  end;
      end;
    end;
  end;
  Table.Col := MaxCol;
  with Table do
    begin
      for i := 1 to Row do
        for j := 1 to Col do
          if Length(Data[i,j]) > TableMax then TableMax := Length(Data[i,j]);
      for i := 1 to Row do
        for j := 1 to Col do begin
          x := TableMax-Length(Data[i,j];
          for k := 1 to x do Data[i,j] := Data[i,j] + ' ';

        end;
    end;
end;

procedure WriteTable(var TextFile :Text;
                     var Table  :THTMLTable);
begin
  for i := 1 to Table.Row do begin
    for j := 1 to Table.Col do
      write(TextFile, Table.Data[i,j],' ');
    writeln(TextFile);
  end;
end;






procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
  Table :THTMLTable;
  c   :Char;
  HTag :integer;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if HTag = TABLE_BEGIN then begin
                        writeln(TextFile);
                        GetTable(HTMLFile, Table);
                        WriteTable(TextFile, Table);
                        writeln(TextFile);
                      end;
    else write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

----------------------

Scrapdog
0
 

Author Comment

by:friberg
ID: 1341252
Thanks, just what I needed!
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
Introduction I have seen many questions in this Delphi topic area where queries in threads are needed or suggested. I know bumped into a similar need. This article will address some of the concepts when dealing with a multithreaded delphi database…
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question