Solved

Reading a PDF file Stored in BLOB table Column as Text and display same

Posted on 2016-10-14
14
40 Views
Last Modified: 2016-11-05
Hi I would like to be able to read a PDF file which is stored in a BLOB column on my Oracle Table. Is it possible, using either SQL or PL/SQL to extract the data as text for display or storing?

Thanks
0
Comment
Question by:Des Des
  • 5
  • 5
  • 2
14 Comments
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41844628
Sorry but the question is still too vague.

When you say "display" do you mean with all the formatting?
Displayed in what program or application?
Is the PDF actual text based or is it more of an image with text as part of the picture and you need OCR ability?
Storing is completely different.  Stored as what?
What versions of PDF do you need the ability to handle?
What version of Oracle database can you use?

All that said, you might take a look at using Oracle Text and CTX_DOC.FILTER to extract text from a PDF:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF0702

It won't handle the formatting and is limited to PDF version based on the Oracle database version.

There are probably some 3rd party products that offer some alternatives but what ones would be based on your exact requirements.
0
 
LVL 34

Expert Comment

by:johnsone
ID: 41844919
The CTX_DOC.FILTER procedure requires that a Oracle Text index be created for the column.  If you don't need that, look at the CTX_DOC.POLICY_FILTER procedure which does not require a Text index.  Doc for that is -> http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdocpkg.htm#CCREF2116  Likely a few pages below the reference that was posted for CTX_DOC.FILTER.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41845269
Yes, I agree that the policy_filter is probably better if you don't need the Text index.
0
 

Author Comment

by:Des Des
ID: 41846223
Apologies, could not get logged in over the weekend. The PDF file is text based and not very large. Each archived record is stored in a PDF, which is stored in a BLOB column. I just want to extract the data from the PDF and display, no formatting present. I will attempt to use the policy_filter option suggested above and let you know how I get on.

Thanks in anticipation.

Des
0
 

Author Comment

by:Des Des
ID: 41846355
Would it be possible to get an example of using this approach. I have a table called archived_logs and inside this table a column called documents which is BLOB and contains an individual PDF file. Each of these files are basic text. I have tried to use the policy_filter function however I don't fully understand the parameters used in the call.  For example where do I create the policy, the first parameter and how do I store the result from the call to the function?

Thanks
0
 
LVL 76

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 400 total points (awarded by participants)
ID: 41846625
Below is the simplest example I can think of.

What we cannot answer for you is all the different options you have with Oracle Text.

Do you need to have special options set for the LEXER or WORDLIST?

You can read up on everything here:
http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#CCREF0200

The important part of the policy is the inso_filter.
--do once and forget about it
exec CTX_DDL.CREATE_POLICY( policy_name => 'pdf_policy', filter => 'ctxsys.inso_filter');


declare
	v_result	clob;
	v_pdf		blob;
begin
	--get a PDF into a variable
	select mypdf into v_pdf from bob where rownum=1;
	ctx_doc.policy_filter(policy_name => 'pdf_policy',
                      document => v_pdf,
                      restab => v_result,
                      plaintext => true);

	dbms_output.put_line(v_result);
end;
/

Open in new window

0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:Des Des
ID: 41846921
Thank you very much for the code snippet above, appreciate it. I got the code to compile but when I tried to execute I could not see the results in the DBMS_OUTPUT window. I decided to move it into a table, CLOB column, and after I ran I just get the message inside the column, BIGCLOB. Not sure where to go, any advice would be great as I am new to this type of processing.

Thanks
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 41846927
What tool are you using to access the database?

If sqlplus or SQL Developer make sure to enable server output:
set severoutput on

Note that SQL Developer has a menu option for this as well but I don't use it so not sure exactly where it is.  SQL Developer should accept the set command and the output is in it's own window.

>>Not sure where to go, any advice would be great as I am new to this type of processing.

Not sure I can answer this for you.

What do you want to display it in?  I'm guessing you want it parsed out for some specific application and/or reason.
0
 

Author Comment

by:Des Des
ID: 41847012
Hi, I am using Toad and have the DBMS_OUTPUT window enabled. It tries to show details but nothing appears.  I am using Oracle 11g and Toad 10. The details are old receipts from our club, which were stored as PDF's and I need to generate a report showing the details from each of the receipts.

I only want the output, like an SQL simple report but can put the code inside a Stored Procedure if required.

Thanks
0
 

Author Comment

by:Des Des
ID: 41847043
When I use a simple select statement SELECT payment_col FROM receipts WHERE rownum = 1; I get HugeClob displayed in the result set.
0
 
LVL 76

Accepted Solution

by:
slightwv (䄆 Netminder) earned 400 total points (awarded by participants)
ID: 41847051
>>The details are old receipts from our club, which were stored as PDF's

When I read "old receipts" I'm thinking scanned images.  You cannot extract the text from an image without an OCR program.  Oracle filtering won't do that.

If you open the PDF can you manually copy text and paste it into Notepad?  If not, it isn't a text-based PDF.
0
 
LVL 34

Assisted Solution

by:johnsone
johnsone earned 100 total points (awarded by participants)
ID: 41847148
My guess is also images rather than text.

It took me a while to find a document that was usable and set up Oracle Text (last time I did that was 9i and a very long time ago), but the sample code that slightwv posted should work.  The only issue I found was that if the document is large enough, the DBMS_OUTPUT will fail with an error.  Putting a substring on the clob to only grab the first 4000 characters did the trick to at least prove it works.

So, to put the text into a temporary location, create a global temporary table (only have to do this once):
CREATE global TEMPORARY TABLE document_text 
  ( 
     doc_text CLOB 
  ) 
ON COMMIT DELETE ROWS; 

Open in new window

Then run the PL/SQL to put the text into the temp table:
DECLARE 
    v_result CLOB; 
    v_pdf    BLOB; 
BEGIN 
    --get a PDF into a variable 
    SELECT document 
    INTO   v_pdf 
    FROM   archived_logs 
    WHERE  ...; 

    ctx_doc.Policy_filter(policy_name => 'pdf_policy', document => v_pdf, 
    restab => v_result, plaintext => TRUE); 

    INSERT INTO document_text 
    VALUES      (v_result); 
END; 

/ 

Open in new window

Ideally, I would turn that anonymous block into a procedure and send some sort of key value to it.  Then the result should be the only row in the temporary table.  After a commit, the row should disappear.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
ORACLE SQL DEVELOPER Query ? 5 58
UNIX SCP 5 47
Export BLOB data from Oracle 10g 4 25
Outer Query not returning data - SQL HELP 16 40
How to Unravel a Tricky Query Introduction If you browse through the Oracle zones or any of the other database-related zones you'll come across some complicated solutions and sometimes you'll just have to wonder how anyone came up with them.  …
Using SQL Scripts we can save all the SQL queries as files that we use very frequently on our database later point of time. This is one of the feature present under SQL Workshop in Oracle Application Express.
This video shows how to copy a database user from one database to another user DBMS_METADATA.  It also shows how to copy a user's permissions and discusses password hash differences between Oracle 10g and 11g.
Via a live example, show how to restore a database from backup after a simulated disk failure using RMAN.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now