Solved

Extracting text only (500 points)

Posted on 2011-03-18
11
401 Views
Last Modified: 2012-05-11
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.

Either the code or a component.

I am using Delphi Starter XE.

Many thanks for your help.
0
Comment
Question by:rincewind666
  • 3
  • 3
  • 3
  • +1
11 Comments
 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168724
Try this

procedure TForm1.ListBox1Click(Sender: TObject);
begin
if ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]) = '.txt' then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168770
This depends on what type of data you are loading
If its a simple txt file then its very simple
Text file

var
  List: TStringList;
begin
  List := TStringList.Create;
  try
    List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
    Memo1.Lines.Assign(List);
  finally
    Lidt.Free;
  end;
end;

Open in new window

0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168791
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.

You need special handling for each file type.
0
 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168816
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want

procedure TForm1.ListBox1Click(Sender: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168821

For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc

var
  Buffer: string;
  Stream: TFileStream;
  I: Integer;
begin
  Stream := TFileStream.Create(ListBox.Items[ListBox.ItemIndex], fmOpenRead);
  try
    SetLength(Buffer, Stream.Size);
    Stream.Read(Pointer(Buffer)^, Size);
     
    //here you would have to strip out the tags depending on what tye of file you opened
 
     memo1.Lines.Add(Buffer);
  finally
    FreeAndNil(Stream);
  end;
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:rincewind666
ID: 35168866
An example for a htm file:

I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="en-gb" http-equiv="Content-Language" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>

<body>

<p>I want just this text</p>
<p><strong>to be extracted</strong></p>

</body>

</html>
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168936
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm

It reads all the mentioned formats and even more.
0
 
LVL 32

Expert Comment

by:ewangoya
ID: 35168960

As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.

You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files

 
0
 
LVL 9

Accepted Solution

by:
Mahdi78 earned 500 total points
ID: 35168967
This function will convert HTML code to plain text


uses ActiveX, mshtml, ComObj;

{$R *.dfm}

function HtmltoText(HtmlText: string; TrimStr: Boolean = False ): string;
var
  IDoc:      IHTMLDocument2;
  Strl:      TStringList;
  v:         Variant;
begin
    Strl := TStringList.Create;
    try
      Strl.Text := HtmlText;
      Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
      try
        IDoc.designMode:='on';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        v:=VarArrayCreate([0,0],VarVariant);
        v[0]:= Strl.Text;
        IDoc.write(PSafeArray(System.TVarData(v).VArray));
        IDoc.designMode:='off';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        if Trimstr then result := Trim(IDoc.body.innerText)
        else Result := IDoc.body.innerText;
      finally
        IDoc := nil;
      end;
    finally
      Strl.Free;
    end;
end;

Open in new window

0
 

Author Closing Comment

by:rincewind666
ID: 35187894
Many thanks for your help.
0
 

Author Comment

by:rincewind666
ID: 35187919
Many thanks for your help.

A note to jimyx: Thanks for the link to the atviewer component.  It looks great and would have been used but it is not compatable with Delphi XE.  A great shame. Anyway thanks for your help.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Delphi 2010 Export to pdf 2 288
Delphi XE2 application frozen on Windows 10 10 278
Strange code, can use it, but i cant figure out what it does. 3 54
Reconfigure Delphi Install? 2 47
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
Hi friends,  in this video  I'll show you how new windows 10 user can learn the using of windows 10. Thank you.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now