Solved

Extracting text only (500 points)

Posted on 2011-03-18
11
400 Views
Last Modified: 2012-05-11
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.

Either the code or a component.

I am using Delphi Starter XE.

Many thanks for your help.
0
Comment
Question by:rincewind666
  • 3
  • 3
  • 3
  • +1
11 Comments
 
LVL 9

Expert Comment

by:Mahdi78
Comment Utility
Try this

procedure TForm1.ListBox1Click(Sender: TObject);
begin
if ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]) = '.txt' then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
Comment Utility
This depends on what type of data you are loading
If its a simple txt file then its very simple
Text file

var
  List: TStringList;
begin
  List := TStringList.Create;
  try
    List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
    Memo1.Lines.Assign(List);
  finally
    Lidt.Free;
  end;
end;

Open in new window

0
 
LVL 24

Expert Comment

by:jimyX
Comment Utility
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.

You need special handling for each file type.
0
 
LVL 9

Expert Comment

by:Mahdi78
Comment Utility
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want

procedure TForm1.ListBox1Click(Sender: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox1.Items[ListBox1.ItemIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(ListBox1.Items[ListBox1.ItemIndex]);
end;
0
 
LVL 32

Expert Comment

by:ewangoya
Comment Utility

For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc

var
  Buffer: string;
  Stream: TFileStream;
  I: Integer;
begin
  Stream := TFileStream.Create(ListBox.Items[ListBox.ItemIndex], fmOpenRead);
  try
    SetLength(Buffer, Stream.Size);
    Stream.Read(Pointer(Buffer)^, Size);
     
    //here you would have to strip out the tags depending on what tye of file you opened
 
     memo1.Lines.Add(Buffer);
  finally
    FreeAndNil(Stream);
  end;
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 

Author Comment

by:rincewind666
Comment Utility
An example for a htm file:

I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="en-gb" http-equiv="Content-Language" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>

<body>

<p>I want just this text</p>
<p><strong>to be extracted</strong></p>

</body>

</html>
0
 
LVL 24

Expert Comment

by:jimyX
Comment Utility
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm

It reads all the mentioned formats and even more.
0
 
LVL 32

Expert Comment

by:ewangoya
Comment Utility

As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.

You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files

 
0
 
LVL 9

Accepted Solution

by:
Mahdi78 earned 500 total points
Comment Utility
This function will convert HTML code to plain text


uses ActiveX, mshtml, ComObj;

{$R *.dfm}

function HtmltoText(HtmlText: string; TrimStr: Boolean = False ): string;
var
  IDoc:      IHTMLDocument2;
  Strl:      TStringList;
  v:         Variant;
begin
    Strl := TStringList.Create;
    try
      Strl.Text := HtmlText;
      Idoc:=CreateComObject(Class_HTMLDOcument) as IHTMLDocument2;
      try
        IDoc.designMode:='on';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        v:=VarArrayCreate([0,0],VarVariant);
        v[0]:= Strl.Text;
        IDoc.write(PSafeArray(System.TVarData(v).VArray));
        IDoc.designMode:='off';
        while IDoc.readyState<>'complete' do
          Application.ProcessMessages;
        if Trimstr then result := Trim(IDoc.body.innerText)
        else Result := IDoc.body.innerText;
      finally
        IDoc := nil;
      end;
    finally
      Strl.Free;
    end;
end;

Open in new window

0
 

Author Closing Comment

by:rincewind666
Comment Utility
Many thanks for your help.
0
 

Author Comment

by:rincewind666
Comment Utility
Many thanks for your help.

A note to jimyx: Thanks for the link to the atviewer component.  It looks great and would have been used but it is not compatable with Delphi XE.  A great shame. Anyway thanks for your help.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
Introduction I have seen many questions in this Delphi topic area where queries in threads are needed or suggested. I know bumped into a similar need. This article will address some of the concepts when dealing with a multithreaded delphi database…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
You have products, that come in variants and want to set different prices for them? Watch this micro tutorial that describes how to configure prices for Magento super attributes. Assigning simple products to configurable: We assigned simple products…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now