Reading a PDF file Stored in BLOB table Column as Text and display same

Hi I would like to be able to read a PDF file which is stored in a BLOB column on my Oracle Table. Is it possible, using either SQL or PL/SQL to extract the data as text for display or storing?

Thanks
Des DesAsked:
Who is Participating?
 
slightwv (䄆 Netminder)Connect With a Mentor Commented:
>>The details are old receipts from our club, which were stored as PDF's

When I read "old receipts" I'm thinking scanned images.  You cannot extract the text from an image without an OCR program.  Oracle filtering won't do that.

If you open the PDF can you manually copy text and paste it into Notepad?  If not, it isn't a text-based PDF.
0
 
slightwv (䄆 Netminder) Commented:
Sorry but the question is still too vague.

When you say "display" do you mean with all the formatting?
Displayed in what program or application?
Is the PDF actual text based or is it more of an image with text as part of the picture and you need OCR ability?
Storing is completely different.  Stored as what?
What versions of PDF do you need the ability to handle?
What version of Oracle database can you use?

All that said, you might take a look at using Oracle Text and CTX_DOC.FILTER to extract text from a PDF:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF0702

It won't handle the formatting and is limited to PDF version based on the Oracle database version.

There are probably some 3rd party products that offer some alternatives but what ones would be based on your exact requirements.
0
 
johnsoneSenior Oracle DBACommented:
The CTX_DOC.FILTER procedure requires that a Oracle Text index be created for the column.  If you don't need that, look at the CTX_DOC.POLICY_FILTER procedure which does not require a Text index.  Doc for that is -> http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF2116  Likely a few pages below the reference that was posted for CTX_DOC.FILTER.
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
slightwv (䄆 Netminder) Commented:
Yes, I agree that the policy_filter is probably better if you don't need the Text index.
0
 
Des DesAuthor Commented:
Apologies, could not get logged in over the weekend. The PDF file is text based and not very large. Each archived record is stored in a PDF, which is stored in a BLOB column. I just want to extract the data from the PDF and display, no formatting present. I will attempt to use the policy_filter option suggested above and let you know how I get on.

Thanks in anticipation.

Des
0
 
Des DesAuthor Commented:
Would it be possible to get an example of using this approach. I have a table called archived_logs and inside this table a column called documents which is BLOB and contains an individual PDF file. Each of these files are basic text. I have tried to use the policy_filter function however I don't fully understand the parameters used in the call.  For example where do I create the policy, the first parameter and how do I store the result from the call to the function?

Thanks
0
 
slightwv (䄆 Netminder)Connect With a Mentor Commented:
Below is the simplest example I can think of.

What we cannot answer for you is all the different options you have with Oracle Text.

Do you need to have special options set for the LEXER or WORDLIST?

You can read up on everything here:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#CCREF0200

The important part of the policy is the inso_filter.
--do once and forget about it
exec CTX_DDL.CREATE_POLICY( policy_name => 'pdf_policy', filter => 'ctxsys.inso_filter');


declare
	v_result	clob;
	v_pdf		blob;
begin
	--get a PDF into a variable
	select mypdf into v_pdf from bob where rownum=1;
	ctx_doc.policy_filter(policy_name => 'pdf_policy',
                      document => v_pdf,
                      restab => v_result,
                      plaintext => true);

	dbms_output.put_line(v_result);
end;
/

Open in new window

0
 
Des DesAuthor Commented:
Thank you very much for the code snippet above, appreciate it. I got the code to compile but when I tried to execute I could not see the results in the DBMS_OUTPUT window. I decided to move it into a table, CLOB column, and after I ran I just get the message inside the column, BIGCLOB. Not sure where to go, any advice would be great as I am new to this type of processing.

Thanks
0
 
slightwv (䄆 Netminder) Commented:
What tool are you using to access the database?

If sqlplus or SQL Developer make sure to enable server output:
set severoutput on

Note that SQL Developer has a menu option for this as well but I don't use it so not sure exactly where it is.  SQL Developer should accept the set command and the output is in it's own window.

>>Not sure where to go, any advice would be great as I am new to this type of processing.

Not sure I can answer this for you.

What do you want to display it in?  I'm guessing you want it parsed out for some specific application and/or reason.
0
 
Des DesAuthor Commented:
Hi, I am using Toad and have the DBMS_OUTPUT window enabled. It tries to show details but nothing appears.  I am using Oracle 11g and Toad 10. The details are old receipts from our club, which were stored as PDF's and I need to generate a report showing the details from each of the receipts.

I only want the output, like an SQL simple report but can put the code inside a Stored Procedure if required.

Thanks
0
 
Des DesAuthor Commented:
When I use a simple select statement SELECT payment_col FROM receipts WHERE rownum = 1; I get HugeClob displayed in the result set.
0
 
johnsoneConnect With a Mentor Senior Oracle DBACommented:
My guess is also images rather than text.

It took me a while to find a document that was usable and set up Oracle Text (last time I did that was 9i and a very long time ago), but the sample code that slightwv posted should work.  The only issue I found was that if the document is large enough, the DBMS_OUTPUT will fail with an error.  Putting a substring on the clob to only grab the first 4000 characters did the trick to at least prove it works.

So, to put the text into a temporary location, create a global temporary table (only have to do this once):
CREATE global TEMPORARY TABLE document_text 
  ( 
     doc_text CLOB 
  ) 
ON COMMIT DELETE ROWS; 

Open in new window

Then run the PL/SQL to put the text into the temp table:
DECLARE 
    v_result CLOB; 
    v_pdf    BLOB; 
BEGIN 
    --get a PDF into a variable 
    SELECT document 
    INTO   v_pdf 
    FROM   archived_logs 
    WHERE  ...; 

    ctx_doc.Policy_filter(policy_name => 'pdf_policy', document => v_pdf, 
    restab => v_result, plaintext => TRUE); 

    INSERT INTO document_text 
    VALUES      (v_result); 
END; 

/ 

Open in new window

Ideally, I would turn that anonymous block into a procedure and send some sort of key value to it.  Then the result should be the only row in the temporary table.  After a commit, the row should disappear.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.