Solved

Extracting text only (500 points)

Posted on 2011-03-18
11
402 Views
Last Modified: 2012-05-11
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.

Either the code or a component.

I am using Delphi Starter XE.

Many thanks for your help.
0
Comment
Question by:rincewind666
  • 3
  • 3
  • 3
  • +1
11 Comments
 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168724
Try this

procedure TForm1.ListBox1Click(Sender: TObject);
begin
if ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]) = '.txt' then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168770
This depends on what type of data you are loading
If its a simple txt file then its very simple
Text file

var
  List: TStringList;
begin
  List := TStringList.Create;
  try
    List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
    Memo1.Lines.Assign(List);
  finally
    Lidt.Free;
  end;
end;

Open in new window

0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168791
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.

You need special handling for each file type.
0
Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168816
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want

procedure TForm1.ListBox1Click(Sender: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168821

For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc

var
  Buffer: string;
  Stream: TFileStream;
  I: Integer;
begin
  Stream := TFileStream.Create(ListBox.Items[ListBox.ItemIndex], fmOpenRead);
  try
    SetLength(Buffer, Stream.Size);
    Stream.Read(Pointer(Buffer)^, Size);
     
    //here you would have to strip out the tags depending on what tye of file you opened
 
     memo1.Lines.Add(Buffer);
  finally
    FreeAndNil(Stream);
  end;
0
 

Author Comment

by:rincewind666
ID: 35168866
An example for a htm file:

I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="en-gb" http-equiv="Content-Language" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>

<body>

<p>I want just this text</p>
<p><strong>to be extracted</strong></p>

</body>

</html>
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168936
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm

It reads all the mentioned formats and even more.
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168960

As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.

You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files

 
0
 
LVL 9

Accepted Solution

by:
Mahdi78 earned 500 total points
ID: 35168967
This function will convert HTML code to plain text


uses ActiveX, mshtml, ComObj;

{$R *.dfm}

function HtmltoText(HtmlText: string; TrimStr: Boolean = False ): string;
var
  IDoc:      IHTMLDocument2;
  Strl:      TStringList;
  v:         Variant;
begin
    Strl := TStringList.Create;
    try
      Strl.Text := HtmlText;
      Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
      try
        IDoc.designMode:='on';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        v:=VarArrayCreate([0,0],VarVariant);
        v[0]:= Strl.Text;
        IDoc.write(PSafeArray(System.TVarData(v).VArray));
        IDoc.designMode:='off';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        if Trimstr then result := Trim(IDoc.body.innerText)
        else Result := IDoc.body.innerText;
      finally
        IDoc := nil;
      end;
    finally
      Strl.Free;
    end;
end;

Open in new window

0
 

Author Closing Comment

by:rincewind666
ID: 35187894
Many thanks for your help.
0
 

Author Comment

by:rincewind666
ID: 35187919
Many thanks for your help.

A note to jimyx: Thanks for the link to the atviewer component.  It looks great and would have been used but it is not compatable with Delphi XE.  A great shame. Anyway thanks for your help.
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to use Hashing sha1 in Delphi 2010 4 267
Delphi - replicating a form 8 73
Tidtcpserver listening on multiports? 1 26
Wincontrol not (correctly) drawn 15 38
A lot of questions regard threads in Delphi.   One of the more specific questions is how to show progress of the thread.   Updating a progressbar from inside a thread is a mistake. A solution to this would be to send a synchronized message to the…
Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
This video shows how to quickly and easily add an email signature for all users on Exchange 2016. The resulting signature is applied on a server level by Exchange Online. The email signature template has been downloaded from: www.mail-signatures…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question