rincewind666
asked on
Extracting text only (500 points)
The user clicks on a list of files in a listbox and, if it is human readable (.txt, .doc, .htm, html, pdf), the text is extracted and displayed in a memo. Just the plain text without anything else.
Either the code or a component.
I am using Delphi Starter XE.
Many thanks for your help.
Either the code or a component.
I am using Delphi Starter XE.
Many thanks for your help.
This depends on what type of data you are loading
If its a simple txt file then its very simple
If its a simple txt file then its very simple
Text file
var
List: TStringList;
begin
List := TStringList.Create;
try
List.LoadFromFile(ListBox.Items[ListBox.ItemIndex]);
Memo1.Lines.Assign(List);
finally
Lidt.Free;
end;
end;
The only human readable files in the Memo can be txt but the others .doc, pdf will not htm/html will show the HTML scripts as well.
You need special handling for each file type.
You need special handling for each file type.
As jimyx said, you will read just txt files as plain text, so if you want to read html code or binary text of pdf this code is exactly what you want
procedure TForm1.ListBox1Click(Sende r: TObject);
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox 1.Items[Li stBox1.Ite mIndex]), MyExt) > 0 then
Memo1.Lines.LoadFromFile(L istBox1.It ems[ListBo x1.ItemInd ex]);
end;
procedure TForm1.ListBox1Click(Sende
const MyExt = '.txt.doc.pdf.html.htm';
begin
if Pos(ExtractFileExt(ListBox
Memo1.Lines.LoadFromFile(L
end;
For files like pdf or html, you have to know how to remove the tags but this is not really worth it.
Its much better to use a viewr for each type of file, eg TWebbrowser, TOleContainer, TRichEdit etc
var
Buffer: string;
Stream: TFileStream;
I: Integer;
begin
Stream := TFileStream.Create(ListBox
try
SetLength(Buffer, Stream.Size);
Stream.Read(Pointer(Buffer
//here you would have to strip out the tags depending on what tye of file you opened
memo1.Lines.Add(Buffer);
finally
FreeAndNil(Stream);
end;
ASKER
An example for a htm file:
I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="en-gb" http-equiv="Content-Langua ge" />
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>
<body>
<p>I want just this text</p>
<p><strong>to be extracted</strong></p>
</body>
</html>
I need "I want just this text to be extracted" in the memo, using the following html file( the same principle for doc and pdf files without any tags):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="en-gb" http-equiv="Content-Langua
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>Untitled 1</title>
</head>
<body>
<p>I want just this text</p>
<p><strong>to be extracted</strong></p>
</body>
</html>
There is one component I know which can help you here:
http://atorg.net.ru/delphi/atviewer.htm
It reads all the mentioned formats and even more.
http://atorg.net.ru/delphi/atviewer.htm
It reads all the mentioned formats and even more.
As I said, this is not trivial, the tags in a PDF file are not the same as the tags in a HTML file. MS word files are binary so you wont get much out of them.
You really need a viewer to do this, just use a TOleContainer, that should be able to display all types of files
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Many thanks for your help.
ASKER
Many thanks for your help.
A note to jimyx: Thanks for the link to the atviewer component. It looks great and would have been used but it is not compatable with Delphi XE. A great shame. Anyway thanks for your help.
A note to jimyx: Thanks for the link to the atviewer component. It looks great and would have been used but it is not compatable with Delphi XE. A great shame. Anyway thanks for your help.
procedure TForm1.ListBox1Click(Sende
begin
if ExtractFileExt(ListBox1.It
Memo1.Lines.LoadFromFile(L
end;