Solved

Save html as text

Posted on 1998-09-29
9
306 Views
Last Modified: 2010-04-06
How can I convert an html file to a text file?
I think that this is possible with the HTML Control, but I need something that doesn't require that much resources.

Any help is greatly appreciated.
0
Comment
Question by:friberg
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341244
friberg

do you need the html tags as well or just the text in the html...

if you require everything in the html file here is a simple way to do it...

procedure TForm1.Button1Click(Sender: TObject);
begin
 Memo1.Lines.LoadFromFile('c:\temp\test.htm');
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
 Memo1.Lines.SaveToFile('c:\temp\test.txt');
end;

if you need to strip all the tags out of the file then you will need to search through the htm/html page and look for < /> etc. you can also load the htm/html page into mem if you don't want to use memos etc...

Later
BoRiS
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1341245
procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then InsideTag := true
    else if c = '>' then InsideTag := false
    else if not InsideTag then write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

------------------------

The above function procedure accepts two filenames.  All of the tags are extracted from the HTML File, and only the raw text from the HTML file is written to the text file.  That is all this does.  Note that no formatting is done to the text.

Also, to keep it simple, I didn't include anything in this procedure to substitute anything succeeding the & operator (such as <).  If the procedure encounters <, it is written to the text file as < rather than <.  You can easily do this by altering the line that I marked as {<-- writes to text file}.  Just text if c = '&' and then read the next two characters.
0
 

Author Comment

by:friberg
ID: 1341246
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 1

Expert Comment

by:duke_n
ID: 1341247
text file
with tags or w/o tags
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:friberg
ID: 1341248
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341249
friberg

sorry the answer was surposed to be sent as a comment until you told me what you needed with tags or without...

Later
BoRiS
0
 

Author Comment

by:friberg
ID: 1341250
scrapdog,

Your solution works very well with simple html files, it is very fast and reliable. But how can I modify it so html files that contain tables also are readable? I'd like a space character between columns and a new line for each row in the table.

Thanks.
0
 
LVL 5

Accepted Solution

by:
scrapdog earned 100 total points
ID: 1341251
I just wrote this, but didn't test it, but hopefully you can see how the logic works.  This does the same thing as the last piece of code I gave you, with table support.

When this program senses a table, it determines the number of rows and columns by the number of TR and TD tags found.  The text in the cells are lined up and left justified.

Again, this is simple, as it doesn't check the tags for errors.  It only knows where the table begins and ends, where the rows begin and end, and where the cells begin and end.  The row with the largest number of columns becomes the width of the table, and the largest cell in the whole table becomes the width of all the cells (which are padded with spaces).  Spaces are inserted between columns, and a new line is started at the end of a row.

Since I didn't test it, there might be syntactical errors.  If it is so bad that you can't fix it, let me know.

Here it is:
-----------

const
  TABLE_BEGIN = 1;
  TABLE_END   = 2;
  CELL_BEGIN  = 3;
  CELL_END    = 4;
  ROW_BEGIN   = 5;
  ROW_END     = 6;

  MAXTABLEROWS = 100;
  MAXTABLECOLS = 100;

type
   THTMLTable = record
                  Row, Col :integer;
                  Data  :array[0..MAXTABLEROWS, 0..MAXTABLECOLS] of string;
                end;



function GetTag(var HTMLFile :Text;) :integer;
var t :string;
    InTag :boolean;
    i, x :integer;
    c :char;
begin
  t := '';
  InTag := true;
  while (not eof(HTMLFile)) and InTag do begin
    read(HTMLFile, c);
    if c = '>' then Intag := False
    else t := t + c;
  end;
  for i := 1 to length(t) do t[i] := upcase(t[i]);
  if copy(t,1,5) = 'TABLE' then x := TABLE_BEGIN
  else if copy(t,1,6) = '/TABLE' then x := TABLE_END
  else if copy(t,1,2) = 'TR' then x := ROW_BEGIN
  else if copy(t,1,3) = '/TR' then x := ROW_END
  else if copy(t,1,2) = 'TD' then x := CELL_BEGIN
  else if copy(t,1,3) = '/TD' then x := CELL_END
  else x := 0;
  Result := x;
end;

procedure GetCell(var HTMLFile :Text; var Cell :string);
var InCell, HTag :boolean;
    c  :char;
begin
  Cell := '';
  InCell := true;
  while (not eof(HTMLFile)) and InCell do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if (HTag = CELL_END) then InCell := false;
    end
    else Cell := Cell + c;
  end;
end;

procedure GetTable(var HTMLFile :Text;  Table :THTMLTable);
var
  MaxCol :integer;
  InTable :boolean;
  c  :char;
  i,j,k,x :integer;
begin
  Table.Col := 0;
  Table.Row := 0;
  MaxCol := 0;
  InTable := true;
  while not(eof(HTMLFile)) and InTable do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      case HTag of
        TABLE_END:  InTable := false;
        ROW_BEGIN: begin
                     Table.Row := Table.Row + 1;
                   end;
        CELL_BEGIN: begin
                      GetCell(HTMLFile, Cell);
                      with Table do begin
                        Col := Col + 1;
                        Data[row, col] := Cell;
                        if Col > MaxCol then MaxCol := Col;
                      end;
                    end;
        ROW_END:  begin
                      if Table.Col < MaxCol then begin
                        for i := Table.Col+1 to MaxCol do
                          Data[Row, i] := '';
                  end;
      end;
    end;
  end;
  Table.Col := MaxCol;
  with Table do
    begin
      for i := 1 to Row do
        for j := 1 to Col do
          if Length(Data[i,j]) > TableMax then TableMax := Length(Data[i,j]);
      for i := 1 to Row do
        for j := 1 to Col do begin
          x := TableMax-Length(Data[i,j];
          for k := 1 to x do Data[i,j] := Data[i,j] + ' ';

        end;
    end;
end;

procedure WriteTable(var TextFile :Text;
                     var Table  :THTMLTable);
begin
  for i := 1 to Table.Row do begin
    for j := 1 to Table.Col do
      write(TextFile, Table.Data[i,j],' ');
    writeln(TextFile);
  end;
end;






procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
  Table :THTMLTable;
  c   :Char;
  HTag :integer;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if HTag = TABLE_BEGIN then begin
                        writeln(TextFile);
                        GetTable(HTMLFile, Table);
                        WriteTable(TextFile, Table);
                        writeln(TextFile);
                      end;
    else write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

----------------------

Scrapdog
0
 

Author Comment

by:friberg
ID: 1341252
Thanks, just what I needed!
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Creating an auto free TStringList The TStringList is a basic and frequently used object in Delphi. On many occasions, you may want to create a temporary list, process some items in the list and be done with the list. In such cases, you have to…
Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
This video discusses moving either the default database or any database to a new volume.
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now