Solved

Extracting text only (500 points)

Posted on 2011-03-18
11
403 Views
Last Modified: 2012-05-11
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.

Either the code or a component.

I am using Delphi Starter XE.

Many thanks for your help.
0
Comment
Question by:rincewind666
  • 3
  • 3
  • 3
  • +1
11 Comments
 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168724
Try this

procedure TForm1.ListBox1Click(Sender: TObject);
begin
if ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]) = '.txt' then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168770
This depends on what type of data you are loading
If its a simple txt file then its very simple
Text file

var
  List: TStringList;
begin
  List := TStringList.Create;
  try
    List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
    Memo1.Lines.Assign(List);
  finally
    Lidt.Free;
  end;
end;

Open in new window

0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168791
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.

You need special handling for each file type.
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168816
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want

procedure TForm1.ListBox1Click(Sender: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168821

For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc

var
  Buffer: string;
  Stream: TFileStream;
  I: Integer;
begin
  Stream := TFileStream.Create(ListBox.Items[ListBox.ItemIndex], fmOpenRead);
  try
    SetLength(Buffer, Stream.Size);
    Stream.Read(Pointer(Buffer)^, Size);
     
    //here you would have to strip out the tags depending on what tye of file you opened
 
     memo1.Lines.Add(Buffer);
  finally
    FreeAndNil(Stream);
  end;
0
 

Author Comment

by:rincewind666
ID: 35168866
An example for a htm file:

I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="en-gb" http-equiv="Content-Language" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>

<body>

<p>I want just this text</p>
<p><strong>to be extracted</strong></p>

</body>

</html>
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168936
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm

It reads all the mentioned formats and even more.
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168960

As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.

You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files

 
0
 
LVL 9

Accepted Solution

by:
Mahdi78 earned 500 total points
ID: 35168967
This function will convert HTML code to plain text


uses ActiveX, mshtml, ComObj;

{$R *.dfm}

function HtmltoText(HtmlText: string; TrimStr: Boolean = False ): string;
var
  IDoc:      IHTMLDocument2;
  Strl:      TStringList;
  v:         Variant;
begin
    Strl := TStringList.Create;
    try
      Strl.Text := HtmlText;
      Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
      try
        IDoc.designMode:='on';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        v:=VarArrayCreate([0,0],VarVariant);
        v[0]:= Strl.Text;
        IDoc.write(PSafeArray(System.TVarData(v).VArray));
        IDoc.designMode:='off';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        if Trimstr then result := Trim(IDoc.body.innerText)
        else Result := IDoc.body.innerText;
      finally
        IDoc := nil;
      end;
    finally
      Strl.Free;
    end;
end;

Open in new window

0
 

Author Closing Comment

by:rincewind666
ID: 35187894
Many thanks for your help.
0
 

Author Comment

by:rincewind666
ID: 35187919
Many thanks for your help.

A note to jimyx: Thanks for the link to the atviewer component.  It looks great and would have been used but it is not compatable with Delphi XE.  A great shame. Anyway thanks for your help.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
This video shows how to quickly and easily add an email signature for all users on Exchange 2016. The resulting signature is applied on a server level by Exchange Online. The email signature template has been downloaded from: www.mail-signatures…
Email security requires an ever evolving service that stays up to date with counter-evolving threats. The Email Laundry perform Research and Development to ensure their email security service evolves faster than cyber criminals. We apply our Threat…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question