Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 328
  • Last Modified:

Save html as text

How can I convert an html file to a text file?
I think that this is possible with the HTML Control, but I need something that doesn't require that much resources.

Any help is greatly appreciated.
0
friberg
Asked:
friberg
  • 4
  • 2
  • 2
  • +1
1 Solution
 
BoRiSCommented:
friberg

do you need the html tags as well or just the text in the html...

if you require everything in the html file here is a simple way to do it...

procedure TForm1.Button1Click(Sender: TObject);
begin
 Memo1.Lines.LoadFromFile('c:\temp\test.htm');
end;

procedure TForm1.Button2Click(Sender: TObject);
begin
 Memo1.Lines.SaveToFile('c:\temp\test.txt');
end;

if you need to strip all the tags out of the file then you will need to search through the htm/html page and look for < /> etc. you can also load the htm/html page into mem if you don't want to use memos etc...

Later
BoRiS
0
 
scrapdogCommented:
procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then InsideTag := true
    else if c = '>' then InsideTag := false
    else if not InsideTag then write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

------------------------

The above function procedure accepts two filenames.  All of the tags are extracted from the HTML File, and only the raw text from the HTML file is written to the text file.  That is all this does.  Note that no formatting is done to the text.

Also, to keep it simple, I didn't include anything in this procedure to substitute anything succeeding the & operator (such as <).  If the procedure encounters <, it is written to the text file as < rather than <.  You can easily do this by altering the line that I marked as {<-- writes to text file}.  Just text if c = '&' and then read the next two characters.
0
 
fribergAuthor Commented:
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
duke_nCommented:
text file
with tags or w/o tags
0
 
fribergAuthor Commented:
What I'm looking for is an easy way to strip all the tags from the html file, like the 'save as text' function works in Netscape and IE. For example, if the html file contains a table, I'd like to have the columns separated by space characters.

Maybe there is a freeware component for this?
0
 
BoRiSCommented:
friberg

sorry the answer was surposed to be sent as a comment until you told me what you needed with tags or without...

Later
BoRiS
0
 
fribergAuthor Commented:
scrapdog,

Your solution works very well with simple html files, it is very fast and reliable. But how can I modify it so html files that contain tables also are readable? I'd like a space character between columns and a new line for each row in the table.

Thanks.
0
 
scrapdogCommented:
I just wrote this, but didn't test it, but hopefully you can see how the logic works.  This does the same thing as the last piece of code I gave you, with table support.

When this program senses a table, it determines the number of rows and columns by the number of TR and TD tags found.  The text in the cells are lined up and left justified.

Again, this is simple, as it doesn't check the tags for errors.  It only knows where the table begins and ends, where the rows begin and end, and where the cells begin and end.  The row with the largest number of columns becomes the width of the table, and the largest cell in the whole table becomes the width of all the cells (which are padded with spaces).  Spaces are inserted between columns, and a new line is started at the end of a row.

Since I didn't test it, there might be syntactical errors.  If it is so bad that you can't fix it, let me know.

Here it is:
-----------

const
  TABLE_BEGIN = 1;
  TABLE_END   = 2;
  CELL_BEGIN  = 3;
  CELL_END    = 4;
  ROW_BEGIN   = 5;
  ROW_END     = 6;

  MAXTABLEROWS = 100;
  MAXTABLECOLS = 100;

type
   THTMLTable = record
                  Row, Col :integer;
                  Data  :array[0..MAXTABLEROWS, 0..MAXTABLECOLS] of string;
                end;



function GetTag(var HTMLFile :Text;) :integer;
var t :string;
    InTag :boolean;
    i, x :integer;
    c :char;
begin
  t := '';
  InTag := true;
  while (not eof(HTMLFile)) and InTag do begin
    read(HTMLFile, c);
    if c = '>' then Intag := False
    else t := t + c;
  end;
  for i := 1 to length(t) do t[i] := upcase(t[i]);
  if copy(t,1,5) = 'TABLE' then x := TABLE_BEGIN
  else if copy(t,1,6) = '/TABLE' then x := TABLE_END
  else if copy(t,1,2) = 'TR' then x := ROW_BEGIN
  else if copy(t,1,3) = '/TR' then x := ROW_END
  else if copy(t,1,2) = 'TD' then x := CELL_BEGIN
  else if copy(t,1,3) = '/TD' then x := CELL_END
  else x := 0;
  Result := x;
end;

procedure GetCell(var HTMLFile :Text; var Cell :string);
var InCell, HTag :boolean;
    c  :char;
begin
  Cell := '';
  InCell := true;
  while (not eof(HTMLFile)) and InCell do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if (HTag = CELL_END) then InCell := false;
    end
    else Cell := Cell + c;
  end;
end;

procedure GetTable(var HTMLFile :Text;  Table :THTMLTable);
var
  MaxCol :integer;
  InTable :boolean;
  c  :char;
  i,j,k,x :integer;
begin
  Table.Col := 0;
  Table.Row := 0;
  MaxCol := 0;
  InTable := true;
  while not(eof(HTMLFile)) and InTable do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      case HTag of
        TABLE_END:  InTable := false;
        ROW_BEGIN: begin
                     Table.Row := Table.Row + 1;
                   end;
        CELL_BEGIN: begin
                      GetCell(HTMLFile, Cell);
                      with Table do begin
                        Col := Col + 1;
                        Data[row, col] := Cell;
                        if Col > MaxCol then MaxCol := Col;
                      end;
                    end;
        ROW_END:  begin
                      if Table.Col < MaxCol then begin
                        for i := Table.Col+1 to MaxCol do
                          Data[Row, i] := '';
                  end;
      end;
    end;
  end;
  Table.Col := MaxCol;
  with Table do
    begin
      for i := 1 to Row do
        for j := 1 to Col do
          if Length(Data[i,j]) > TableMax then TableMax := Length(Data[i,j]);
      for i := 1 to Row do
        for j := 1 to Col do begin
          x := TableMax-Length(Data[i,j];
          for k := 1 to x do Data[i,j] := Data[i,j] + ' ';

        end;
    end;
end;

procedure WriteTable(var TextFile :Text;
                     var Table  :THTMLTable);
begin
  for i := 1 to Table.Row do begin
    for j := 1 to Table.Col do
      write(TextFile, Table.Data[i,j],' ');
    writeln(TextFile);
  end;
end;






procedure HTMLToText(var HTMLFileName :string;
                     var TextFileName     :string);
var
  HTMLFile :Text;
  TextFile :Text;
  InsideTag :boolean;
  Table :THTMLTable;
  c   :Char;
  HTag :integer;
begin
  Assign(HTMLFile, HTMLFileName);
  Reset(HTMLFile);
  Assign(TextFile, TextFileName);
  Rewrite(TextFile);
  InsideTag := false;
  while not eof(HTMLFile) do begin
    read(HTMLFile, c);
    if c = '<' then begin
      HTag := GetTag(HTMLFile);
      if HTag = TABLE_BEGIN then begin
                        writeln(TextFile);
                        GetTable(HTMLFile, Table);
                        WriteTable(TextFile, Table);
                        writeln(TextFile);
                      end;
    else write(TextFile, c);  {<--writes to text file}
  end;
  Close(TextFile);
  Close(HTMLFile);
end;

----------------------

Scrapdog
0
 
fribergAuthor Commented:
Thanks, just what I needed!
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 4
  • 2
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now