Solved

Save html as text

Posted on 1998-09-29
9
309 Views
Last Modified: 2010-04-06
How can I convert an html file to a text file?
I think that this is possible with the HTML Control, but I need something that doesn't require that much resources.

Any help is greatly appreciated.
0
Comment
Question by:friberg
  • 4
  • 2
  • 2
  • +1
9 Comments
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341244
friberg

do you need the html tags as well or just the text in the html...

if you require everything in the html file here is a simple way to do it...

procedure TForm1.Button1Click(Sender: TObject);
begin
 Memo1.Lines.LoadFromFile('c:\temp\test.htm');
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
 Memo1.Lines.SaveToFile('c:\temp\test.txt');
end;

if you need to strip all the tags out of the file then you will need to search through the htm/html page and look for < /> etc. you can also load the htm/html page into mem if you don't want to use memos etc...

Later
BoRiS
0
 
LVL 5

Expert Comment

by:scrapdog
ID: 1341245
procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then InsideTag := true
    else if c = '>' then InsideTag := false
    else if not InsideTag then write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

------------------------

The above function procedure accepts two filenames.  All of the tags are extracted from the HTML File, and only the raw text from the HTML file is written to the text file.  That is all this does.  Note that no formatting is done to the text.

Also, to keep it simple, I didn't include anything in this procedure to substitute anything succeeding the & operator (such as <).  If the procedure encounters <, it is written to the text file as < rather than <.  You can easily do this by altering the line that I marked as {<-- writes to text file}.  Just text if c = '&' and then read the next two characters.
0
 

Author Comment

by:friberg
ID: 1341246
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 1

Expert Comment

by:duke_n
ID: 1341247
text file
with tags or w/o tags
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:friberg
ID: 1341248
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
LVL 4

Expert Comment

by:BoRiS
ID: 1341249
friberg

sorry the answer was surposed to be sent as a comment until you told me what you needed with tags or without...

Later
BoRiS
0
 

Author Comment

by:friberg
ID: 1341250
scrapdog,

Your solution works very well with simple html files, it is very fast and reliable. But how can I modify it so html files that contain tables also are readable? I'd like a space character between columns and a new line for each row in the table.

Thanks.
0
 
LVL 5

Accepted Solution

by:
scrapdog earned 100 total points
ID: 1341251
I just wrote this, but didn't test it, but hopefully you can see how the logic works.  This does the same thing as the last piece of code I gave you, with table support.

When this program senses a table, it determines the number of rows and columns by the number of TR and TD tags found.  The text in the cells are lined up and left justified.

Again, this is simple, as it doesn't check the tags for errors.  It only knows where the table begins and ends, where the rows begin and end, and where the cells begin and end.  The row with the largest number of columns becomes the width of the table, and the largest cell in the whole table becomes the width of all the cells (which are padded with spaces).  Spaces are inserted between columns, and a new line is started at the end of a row.

Since I didn't test it, there might be syntactical errors.  If it is so bad that you can't fix it, let me know.

Here it is:
-----------

const
  TABLE_BEGIN = 1;
  TABLE_END   = 2;
  CELL_BEGIN  = 3;
  CELL_END    = 4;
  ROW_BEGIN   = 5;
  ROW_END     = 6;

  MAXTABLEROWS = 100;
  MAXTABLECOLS = 100;

type
   THTMLTable = record
                  Row, Col :integer;
                  Data  :array[0..MAXTABLEROWS, 0..MAXTABLECOLS] of string;
                end;



function GetTag(var HTMLFile :Text;) :integer;
var t :string;
    InTag :boolean;
    i, x :integer;
    c :char;
begin
  t := '';
  InTag := true;
  while (not eof(HTMLFile)) and InTag do begin
    read(HTMLFile, c);
    if c = '>' then Intag := False
    else t := t + c;
  end;
  for i := 1 to length(t) do t[i] := upcase(t[i]);
  if copy(t,1,5) = 'TABLE' then x := TABLE_BEGIN
  else if copy(t,1,6) = '/TABLE' then x := TABLE_END
  else if copy(t,1,2) = 'TR' then x := ROW_BEGIN
  else if copy(t,1,3) = '/TR' then x := ROW_END
  else if copy(t,1,2) = 'TD' then x := CELL_BEGIN
  else if copy(t,1,3) = '/TD' then x := CELL_END
  else x := 0;
  Result := x;
end;

procedure GetCell(var HTMLFile :Text; var Cell :string);
var InCell, HTag :boolean;
    c  :char;
begin
  Cell := '';
  InCell := true;
  while (not eof(HTMLFile)) and InCell do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if (HTag = CELL_END) then InCell := false;
    end
    else Cell := Cell + c;
  end;
end;

procedure GetTable(var HTMLFile :Text;  Table :THTMLTable);
var
  MaxCol :integer;
  InTable :boolean;
  c  :char;
  i,j,k,x :integer;
begin
  Table.Col := 0;
  Table.Row := 0;
  MaxCol := 0;
  InTable := true;
  while not(eof(HTMLFile)) and InTable do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      case HTag of
        TABLE_END:  InTable := false;
        ROW_BEGIN: begin
                     Table.Row := Table.Row + 1;
                   end;
        CELL_BEGIN: begin
                      GetCell(HTMLFile, Cell);
                      with Table do begin
                        Col := Col + 1;
                        Data[row, col] := Cell;
                        if Col > MaxCol then MaxCol := Col;
                      end;
                    end;
        ROW_END:  begin
                      if Table.Col < MaxCol then begin
                        for i := Table.Col+1 to MaxCol do
                          Data[Row, i] := '';
                  end;
      end;
    end;
  end;
  Table.Col := MaxCol;
  with Table do
    begin
      for i := 1 to Row do
        for j := 1 to Col do
          if Length(Data[i,j]) > TableMax then TableMax := Length(Data[i,j]);
      for i := 1 to Row do
        for j := 1 to Col do begin
          x := TableMax-Length(Data[i,j];
          for k := 1 to x do Data[i,j] := Data[i,j] + ' ';

        end;
    end;
end;

procedure WriteTable(var TextFile :Text;
                     var Table  :THTMLTable);
begin
  for i := 1 to Table.Row do begin
    for j := 1 to Table.Col do
      write(TextFile, Table.Data[i,j],' ');
    writeln(TextFile);
  end;
end;






procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
  Table :THTMLTable;
  c   :Char;
  HTag :integer;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if HTag = TABLE_BEGIN then begin
                        writeln(TextFile);
                        GetTable(HTMLFile, Table);
                        WriteTable(TextFile, Table);
                        writeln(TextFile);
                      end;
    else write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

----------------------

Scrapdog
0
 

Author Comment

by:friberg
ID: 1341252
Thanks, just what I needed!
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Intraweb download file link ? 1 110
Print Graphic and Text to Epson TM-T88v 12 185
can't find the executable in Simulator 1 80
tidtcpserver connection lost handle 2 71
Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now