Solved

Reading a PDF file Stored in BLOB table Column as Text and display same

Posted on 2016-10-14
14
55 Views
Last Modified: 2016-11-05
Hi I would like to be able to read a PDF file which is stored in a BLOB column on my Oracle Table. Is it possible, using either SQL or PL/SQL to extract the data as text for display or storing?

Thanks
0
Comment
Question by:Des Des
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 5
  • 2
14 Comments
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41844628
Sorry but the question is still too vague.

When you say "display" do you mean with all the formatting?
Displayed in what program or application?
Is the PDF actual text based or is it more of an image with text as part of the picture and you need OCR ability?
Storing is completely different.  Stored as what?
What versions of PDF do you need the ability to handle?
What version of Oracle database can you use?

All that said, you might take a look at using Oracle Text and CTX_DOC.FILTER to extract text from a PDF:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF0702

It won't handle the formatting and is limited to PDF version based on the Oracle database version.

There are probably some 3rd party products that offer some alternatives but what ones would be based on your exact requirements.
0
 
LVL 35

Expert Comment

by:johnsone
ID: 41844919
The CTX_DOC.FILTER procedure requires that a Oracle Text index be created for the column.  If you don't need that, look at the CTX_DOC.POLICY_FILTER procedure which does not require a Text index.  Doc for that is -> http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF2116  Likely a few pages below the reference that was posted for CTX_DOC.FILTER.
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41845269
Yes, I agree that the policy_filter is probably better if you don't need the Text index.
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 

Author Comment

by:Des Des
ID: 41846223
Apologies, could not get logged in over the weekend. The PDF file is text based and not very large. Each archived record is stored in a PDF, which is stored in a BLOB column. I just want to extract the data from the PDF and display, no formatting present. I will attempt to use the policy_filter option suggested above and let you know how I get on.

Thanks in anticipation.

Des
0
 

Author Comment

by:Des Des
ID: 41846355
Would it be possible to get an example of using this approach. I have a table called archived_logs and inside this table a column called documents which is BLOB and contains an individual PDF file. Each of these files are basic text. I have tried to use the policy_filter function however I don't fully understand the parameters used in the call.  For example where do I create the policy, the first parameter and how do I store the result from the call to the function?

Thanks
0
 
LVL 77

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 400 total points (awarded by participants)
ID: 41846625
Below is the simplest example I can think of.

What we cannot answer for you is all the different options you have with Oracle Text.

Do you need to have special options set for the LEXER or WORDLIST?

You can read up on everything here:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#CCREF0200

The important part of the policy is the inso_filter.
--do once and forget about it
exec CTX_DDL.CREATE_POLICY( policy_name => 'pdf_policy', filter => 'ctxsys.inso_filter');


declare
	v_result	clob;
	v_pdf		blob;
begin
	--get a PDF into a variable
	select mypdf into v_pdf from bob where rownum=1;
	ctx_doc.policy_filter(policy_name => 'pdf_policy',
                      document => v_pdf,
                      restab => v_result,
                      plaintext => true);

	dbms_output.put_line(v_result);
end;
/

Open in new window

0
 

Author Comment

by:Des Des
ID: 41846921
Thank you very much for the code snippet above, appreciate it. I got the code to compile but when I tried to execute I could not see the results in the DBMS_OUTPUT window. I decided to move it into a table, CLOB column, and after I ran I just get the message inside the column, BIGCLOB. Not sure where to go, any advice would be great as I am new to this type of processing.

Thanks
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41846927
What tool are you using to access the database?

If sqlplus or SQL Developer make sure to enable server output:
set severoutput on

Note that SQL Developer has a menu option for this as well but I don't use it so not sure exactly where it is.  SQL Developer should accept the set command and the output is in it's own window.

>>Not sure where to go, any advice would be great as I am new to this type of processing.

Not sure I can answer this for you.

What do you want to display it in?  I'm guessing you want it parsed out for some specific application and/or reason.
0
 

Author Comment

by:Des Des
ID: 41847012
Hi, I am using Toad and have the DBMS_OUTPUT window enabled. It tries to show details but nothing appears.  I am using Oracle 11g and Toad 10. The details are old receipts from our club, which were stored as PDF's and I need to generate a report showing the details from each of the receipts.

I only want the output, like an SQL simple report but can put the code inside a Stored Procedure if required.

Thanks
0
 

Author Comment

by:Des Des
ID: 41847043
When I use a simple select statement SELECT payment_col FROM receipts WHERE rownum = 1; I get HugeClob displayed in the result set.
0
 
LVL 77

Accepted Solution

by:
slightwv (䄆 Netminder) earned 400 total points (awarded by participants)
ID: 41847051
>>The details are old receipts from our club, which were stored as PDF's

When I read "old receipts" I'm thinking scanned images.  You cannot extract the text from an image without an OCR program.  Oracle filtering won't do that.

If you open the PDF can you manually copy text and paste it into Notepad?  If not, it isn't a text-based PDF.
0
 
LVL 35

Assisted Solution

by:johnsone
johnsone earned 100 total points (awarded by participants)
ID: 41847148
My guess is also images rather than text.

It took me a while to find a document that was usable and set up Oracle Text (last time I did that was 9i and a very long time ago), but the sample code that slightwv posted should work.  The only issue I found was that if the document is large enough, the DBMS_OUTPUT will fail with an error.  Putting a substring on the clob to only grab the first 4000 characters did the trick to at least prove it works.

So, to put the text into a temporary location, create a global temporary table (only have to do this once):
CREATE global TEMPORARY TABLE document_text 
  ( 
     doc_text CLOB 
  ) 
ON COMMIT DELETE ROWS; 

Open in new window

Then run the PL/SQL to put the text into the temp table:
DECLARE 
    v_result CLOB; 
    v_pdf    BLOB; 
BEGIN 
    --get a PDF into a variable 
    SELECT document 
    INTO   v_pdf 
    FROM   archived_logs 
    WHERE  ...; 

    ctx_doc.Policy_filter(policy_name => 'pdf_policy', document => v_pdf, 
    restab => v_result, plaintext => TRUE); 

    INSERT INTO document_text 
    VALUES      (v_result); 
END; 

/ 

Open in new window

Ideally, I would turn that anonymous block into a procedure and send some sort of key value to it.  Then the result should be the only row in the temporary table.  After a commit, the row should disappear.
0

Featured Post

Revamp Your Training Process

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Note: this article covers simple compression. Oracle introduced in version 11g release 2 a new feature called Advanced Compression which is not covered here. General principle of Oracle compression Oracle compression is a way of reducing the d…
How to Create User-Defined Aggregates in Oracle Before we begin creating these things, what are user-defined aggregates?  They are a feature introduced in Oracle 9i that allows a developer to create his or her own functions like "SUM", "AVG", and…
Via a live example, show how to restore a database from backup after a simulated disk failure using RMAN.
This video shows how to copy an entire tablespace from one database to another database using Transportable Tablespace functionality.

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question