Solved

Extracting text only (500 points)

Posted on 2011-03-18
11
405 Views
Last Modified: 2012-05-11
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.

Either the code or a component.

I am using Delphi Starter XE.

Many thanks for your help.
0
Comment
Question by:rincewind666
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 3
  • +1
11 Comments
 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168724
Try this

procedure TForm1.ListBox1Click(Sender: TObject);
begin
if ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]) = '.txt' then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:Ephraim Wangoya
ID: 35168770
This depends on what type of data you are loading
If its a simple txt file then its very simple
Text file

var
  List: TStringList;
begin
  List := TStringList.Create;
  try
    List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
    Memo1.Lines.Assign(List);
  finally
    Lidt.Free;
  end;
end;

Open in new window

0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168791
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.

You need special handling for each file type.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 9

Expert Comment

by:Mahdi78
ID: 35168816
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want

procedure TForm1.ListBox1Click(Sender: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:Ephraim Wangoya
ID: 35168821

For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc

var
  Buffer: string;
  Stream: TFileStream;
  I: Integer;
begin
  Stream := TFileStream.Create(ListBox.Items[ListBox.ItemIndex], fmOpenRead);
  try
    SetLength(Buffer, Stream.Size);
    Stream.Read(Pointer(Buffer)^, Size);
     
    //here you would have to strip out the tags depending on what tye of file you opened
 
     memo1.Lines.Add(Buffer);
  finally
    FreeAndNil(Stream);
  end;
0
 

Author Comment

by:rincewind666
ID: 35168866
An example for a htm file:

I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="en-gb" http-equiv="Content-Language" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>

<body>

<p>I want just this text</p>
<p><strong>to be extracted</strong></p>

</body>

</html>
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35168936
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm

It reads all the mentioned formats and even more.
0
 
LVL 32

Expert Comment

by:Ephraim Wangoya
ID: 35168960

As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.

You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files

 
0
 
LVL 9

Accepted Solution

by:
Mahdi78 earned 500 total points
ID: 35168967
This function will convert HTML code to plain text


uses ActiveX, mshtml, ComObj;

{$R *.dfm}

function HtmltoText(HtmlText: string; TrimStr: Boolean = False ): string;
var
  IDoc:      IHTMLDocument2;
  Strl:      TStringList;
  v:         Variant;
begin
    Strl := TStringList.Create;
    try
      Strl.Text := HtmlText;
      Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
      try
        IDoc.designMode:='on';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        v:=VarArrayCreate([0,0],VarVariant);
        v[0]:= Strl.Text;
        IDoc.write(PSafeArray(System.TVarData(v).VArray));
        IDoc.designMode:='off';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        if Trimstr then result := Trim(IDoc.body.innerText)
        else Result := IDoc.body.innerText;
      finally
        IDoc := nil;
      end;
    finally
      Strl.Free;
    end;
end;

Open in new window

0
 

Author Closing Comment

by:rincewind666
ID: 35187894
Many thanks for your help.
0
 

Author Comment

by:rincewind666
ID: 35187919
Many thanks for your help.

A note to jimyx: Thanks for the link to the atviewer component.  It looks great and would have been used but it is not compatable with Delphi XE.  A great shame. Anyway thanks for your help.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Are you ready to implement Active Directory best practices without reading 300+ pages? You're in luck. In this webinar hosted by Skyport Systems, you gain insight into Microsoft's latest comprehensive guide, with tips on the best and easiest way…

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question