Solved

Save html as text

Posted on 1998-09-29
9
312 Views
Last Modified: 2010-04-06
How can I convert an html file to a text file?
I think that this is possible with the HTML Control, but I need something that doesn't require that much resources.

Any help is greatly appreciated.
0
Comment
Question by:friberg
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341244
friberg

do you need the html tags as well or just the text in the html...

if you require everything in the html file here is a simple way to do it...

procedure TForm1.Button1Click(Sender: TObject);
begin
 Memo1.Lines.LoadFromFile('c:\temp\test.htm');
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
 Memo1.Lines.SaveToFile('c:\temp\test.txt');
end;

if you need to strip all the tags out of the file then you will need to search through the htm/html page and look for < /> etc. you can also load the htm/html page into mem if you don't want to use memos etc...

Later
BoRiS
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1341245
procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then InsideTag := true
    else if c = '>' then InsideTag := false
    else if not InsideTag then write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

------------------------

The above function procedure accepts two filenames.  All of the tags are extracted from the HTML File, and only the raw text from the HTML file is written to the text file.  That is all this does.  Note that no formatting is done to the text.

Also, to keep it simple, I didn't include anything in this procedure to substitute anything succeeding the & operator (such as <).  If the procedure encounters <, it is written to the text file as < rather than <.  You can easily do this by altering the line that I marked as {<-- writes to text file}.  Just text if c = '&' and then read the next two characters.
0
 

Author Comment

by:friberg
ID: 1341246
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 
LVL 1

Expert Comment

by:duke_n
ID: 1341247
text file
with tags or w/o tags
0
 

Author Comment

by:friberg
ID: 1341248
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341249
friberg

sorry the answer was surposed to be sent as a comment until you told me what you needed with tags or without...

Later
BoRiS
0
 

Author Comment

by:friberg
ID: 1341250
scrapdog,

Your solution works very well with simple html files, it is very fast and reliable. But how can I modify it so html files that contain tables also are readable? I'd like a space character between columns and a new line for each row in the table.

Thanks.
0
 
LVL 5

Accepted Solution

by:
scrapdog earned 100 total points
ID: 1341251
I just wrote this, but didn't test it, but hopefully you can see how the logic works.  This does the same thing as the last piece of code I gave you, with table support.

When this program senses a table, it determines the number of rows and columns by the number of TR and TD tags found.  The text in the cells are lined up and left justified.

Again, this is simple, as it doesn't check the tags for errors.  It only knows where the table begins and ends, where the rows begin and end, and where the cells begin and end.  The row with the largest number of columns becomes the width of the table, and the largest cell in the whole table becomes the width of all the cells (which are padded with spaces).  Spaces are inserted between columns, and a new line is started at the end of a row.

Since I didn't test it, there might be syntactical errors.  If it is so bad that you can't fix it, let me know.

Here it is:
-----------

const
  TABLE_BEGIN = 1;
  TABLE_END   = 2;
  CELL_BEGIN  = 3;
  CELL_END    = 4;
  ROW_BEGIN   = 5;
  ROW_END     = 6;

  MAXTABLEROWS = 100;
  MAXTABLECOLS = 100;

type
   THTMLTable = record
                  Row, Col :integer;
                  Data  :array[0..MAXTABLEROWS, 0..MAXTABLECOLS] of string;
                end;



function GetTag(var HTMLFile :Text;) :integer;
var t :string;
    InTag :boolean;
    i, x :integer;
    c :char;
begin
  t := '';
  InTag := true;
  while (not eof(HTMLFile)) and InTag do begin
    read(HTMLFile, c);
    if c = '>' then Intag := False
    else t := t + c;
  end;
  for i := 1 to length(t) do t[i] := upcase(t[i]);
  if copy(t,1,5) = 'TABLE' then x := TABLE_BEGIN
  else if copy(t,1,6) = '/TABLE' then x := TABLE_END
  else if copy(t,1,2) = 'TR' then x := ROW_BEGIN
  else if copy(t,1,3) = '/TR' then x := ROW_END
  else if copy(t,1,2) = 'TD' then x := CELL_BEGIN
  else if copy(t,1,3) = '/TD' then x := CELL_END
  else x := 0;
  Result := x;
end;

procedure GetCell(var HTMLFile :Text; var Cell :string);
var InCell, HTag :boolean;
    c  :char;
begin
  Cell := '';
  InCell := true;
  while (not eof(HTMLFile)) and InCell do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if (HTag = CELL_END) then InCell := false;
    end
    else Cell := Cell + c;
  end;
end;

procedure GetTable(var HTMLFile :Text;  Table :THTMLTable);
var
  MaxCol :integer;
  InTable :boolean;
  c  :char;
  i,j,k,x :integer;
begin
  Table.Col := 0;
  Table.Row := 0;
  MaxCol := 0;
  InTable := true;
  while not(eof(HTMLFile)) and InTable do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      case HTag of
        TABLE_END:  InTable := false;
        ROW_BEGIN: begin
                     Table.Row := Table.Row + 1;
                   end;
        CELL_BEGIN: begin
                      GetCell(HTMLFile, Cell);
                      with Table do begin
                        Col := Col + 1;
                        Data[row, col] := Cell;
                        if Col > MaxCol then MaxCol := Col;
                      end;
                    end;
        ROW_END:  begin
                      if Table.Col < MaxCol then begin
                        for i := Table.Col+1 to MaxCol do
                          Data[Row, i] := '';
                  end;
      end;
    end;
  end;
  Table.Col := MaxCol;
  with Table do
    begin
      for i := 1 to Row do
        for j := 1 to Col do
          if Length(Data[i,j]) > TableMax then TableMax := Length(Data[i,j]);
      for i := 1 to Row do
        for j := 1 to Col do begin
          x := TableMax-Length(Data[i,j];
          for k := 1 to x do Data[i,j] := Data[i,j] + ' ';

        end;
    end;
end;

procedure WriteTable(var TextFile :Text;
                     var Table  :THTMLTable);
begin
  for i := 1 to Table.Row do begin
    for j := 1 to Table.Col do
      write(TextFile, Table.Data[i,j],' ');
    writeln(TextFile);
  end;
end;






procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
  Table :THTMLTable;
  c   :Char;
  HTag :integer;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if HTag = TABLE_BEGIN then begin
                        writeln(TextFile);
                        GetTable(HTMLFile, Table);
                        WriteTable(TextFile, Table);
                        writeln(TextFile);
                      end;
    else write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

----------------------

Scrapdog
0
 

Author Comment

by:friberg
ID: 1341252
Thanks, just what I needed!
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…

803 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question