Solved

Convert pdf to free text form for fulltext search

Posted on 2011-09-10
9
458 Views
Last Modified: 2012-05-12
I have a bunch of pdf files that I have stored as large blobs in a database, which seems to be working well for managing them and displaying them.  I also want to be able to search the text in the documents using PHP and MySQL to look for keywords or phrases using either the LIKE mysql function or FULLTEXT search.  It seems that I would best off converting the PDF blob data into a simple text version with the extracted data in free text form.  Is there a PHP class or function that can do this, and can you suggest how I would then go about searching the free text data for a keyword or phrase.  

I already have the PDF's in a MySQL database table called pdf_files, with the large blob field simply called pdf.
0
Comment
Question by:sscotti
  • 4
  • 4
9 Comments
 
LVL 23

Expert Comment

by:nemws1
ID: 36517513
Do the PDFs contain actual text data (as opposed to scanned pages of text, which would need to be OCR'ed first).  If they do contain text, there's a command line utility as part of Ghostscript called "pdf2text" that will easily extract out all the textual data from your PDFs.

If you're running windows, you might be able to easily install it using Cygwin.
0
 
LVL 5

Author Comment

by:sscotti
ID: 36517548
It is actually probably a bit more complicated question.  I've been playing around and it seems there many different formats for PDF documents.  I have basically 2 types and a fair bit of software on my computer.

1.  Powerpoint presentations that I want to convert to PDF files.  I can do this in several ways:

a.  Use Powerpoint to save them as a PDF.
b.  Use the Mac OS print feature to save them as a PDF to disk.
b.  Use the Mac OS print feature to save them as an Adobe PDF (I have Adobe Acrobat Pro also).

2.  Various journal articles from medicine that I get from a number of resources.


It seems that the journal articles and the documents that I could create using the Adobe PDF creator have more information (i.e. the text) than the ones that I create the other ways, which seem to be bitmaps.

I do have a program called Wondershare PDF creator that can apparently also convert PDF's to text.

Just kind of wondering what the best way to go is.

For some of the journal articles I am able to search the PDF LargeBlob directly and get hits with my search string in MySQL, although it seems to be case sensitive there.  Seems like it might be best to create text version of all of the PDF's and store those in the database as well when performing searches since I guess I can do full text searches and get around the case sensitivity as well.  Is that true.

Can you give an example of how to create a query to search for keywords or phrases in the pdf or text doc?
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36517556
I think we need to see your test data.  Please post the information you have in your data base blob columns, thanks. ~Ray
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 
LVL 5

Author Comment

by:sscotti
ID: 36519161
Well, here are some examples of the PDF created in various ways, as well as the original ppt file, and a converted text file.

They are at:

http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
http://sscotti.com/CaseBreast1macos.pdf, saved as pdf with os x print
http://sscotti.com/CaseBreast1ppt.pdf, saved as pdf from save as in ppt
http://sscotti.com/CaseBreast1.ppt, original powerpoint file
http://sscotti.com/CaseBreast1.txt, original powerpoint file

Just examining them with a text/hex editor, using BBEdit, it looks like the one saved as an Adobe PDF file retains the searchable text data, although in binary format since I have it stored as a large blob.

So question.  If I store the raw pdf data as long text instead of a large blob, can I search the raw data without case insensitivity and still use the field to store the pdf.

I think I have the mysql queries worked out for using the LIKE match.

DATABASE with table pdf_files has one table with 6 columns:

pid  INT
section VARCHAR  (Breast, Cardiac, etc)
name VARCHAR (Title for document)
casenumber INT (case number for document)
pdfdata LONGLBLOB (contains binary data for pdf or other type of doc, i.e. word, ppt, etc., just using pdf for now)  can I change to LONGTEXT and do FULLTEXT search and still use for pdfdata, think the pdfdata has to be binary.
pdftext LONGTEXT, trying this to store converted plain text version

query is like:

$query = "SELECT * from pdf_files WHERE pdftext LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

or

$query = "SELECT * from pdf_files WHERE pdfdata LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

I don't know how to do a FULLTEXT search, but you can't do that on a LONGBLOB anyway I think.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36519386
I have not made it to testing these files with MySQL yet, but I can show you this much.  The script makes a case-insensitive search of the PDF and PPT files using a regular expression.  It finds three instances of "idiopathic" in each file.
http://www.laprbass.com/RAY_temp_sscotti.php

These files are hundreds of thousands of bytes long, and that is not what you would usually want to put into a data base.  So I might organize the application in a way that would enable it to "learn" as it did searches of the files.  In this environment, the pid, section, name and case number would stay in the pdf_files table, but the PDF itself would reside in the server file system.  You would add a DATETIME column along with a URL column that pointed to the PDF or PPT in the server's file system.  You would add a new pdf_keywords table.  This table would contain a search_word, a DATETIME column and a pid column that pointed to the rows of the pdf_files table for each of the PDF files that contained the word.  Whenever a search request arrived, the search script would look for PDF files (if any) that were newer than the most recent row of the pdf_keywords table matching the keyword.  It would search these newer PDF files and update the pdf_keywords table to add a row for every PDF document that matched the search keyword.  Then it would build its client response from the freshly updated pdf_keywords table.

That is a little more complicated to write than a fulltext search, but it will perform many times better.  With the fulltext search, you will do a table scan for every search.  A table scan, to data base programmers, is a little like dog poop - out of place wherever you find it.  If your data base grows to any appreciable size you run the risk that the table scans will seriously degrade performance as you repeatedly scan the large blob or text fields.  In contrast the learning search design will never have to re-scan any of the PDF or PPT files for the same keyword.  Once the keyword is in the pdf_keywords table, only the new, unscanned files will be inspected.  A sentient being might even create lists of "unscannable" words like the, that, is, set, etc.  You could also prime the pump by scanning all the files for expected search words.  You might consider an algorithm that would keep track of words that appear too frequently (like in 90% of the documents) so you could tell the client that the word is unspecific.

Philosophically, you would probably want to keep the files as close as possible to the original text.  So simple PPT files (not PPTX) and PDF files that are compatible with older versions of Adobe Reader will probably serve you best.

HTH, ~Ray
<?php // RAY_temp_sscotti.php
error_reporting(E_ALL);
echo "<pre>"; // MAKE IT EASY TO READ


// FROM EE: http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
echo PHP_EOL . '********************************** PDF *****' . PHP_EOL;
$pdf = file_get_contents('RAY_temp_sscotti.pdf');

// ACTIVATE THIS TO SEE THE ENTIRE PDF
// var_dump($pdf);

// MAKE A CASE-INSENSITIVE SEARCH
preg_match_all('/IDIOPATHIC/i', $pdf, $mat);
var_dump($mat);


echo PHP_EOL;


// FROM EE: http://sscotti.com/CaseBreast1.ppt, original powerpoint file
echo PHP_EOL . '********************************** PPT *****' . PHP_EOL;
$ppt = file_get_contents('RAY_temp_sscotti.ppt');

// ACTIVATE THIS TO SEE THE ENTIRE PPT
// var_dump($ppt);

// MAKE A CASE-INSENSITIVE SEARCH
preg_match_all('/IDIOPATHIC/i', $ppt, $mat);
var_dump($mat);
echo PHP_EOL;

Open in new window

0
 
LVL 5

Author Comment

by:sscotti
ID: 36520220
Hey, thanks for that.  I have gone back and forth with storing the data in a database vs. on the server.  I have another app for radiology that works in a similar fashion, although the BLOB data are mostly pictures there and generally are not any larger than 500K at most each, with a sprinkling of other documents.  I like having them in a database as it is a single file, a little more portable and in some ways easier to manage.  With the arrangement I have now I probably would be better of storing them on the server and putting other parameters in the database.

The actual app is here:  sscotti.com

Haven't implemented your script, but will, thanks.

Stephen
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36520268
Thanks for the points.  This is a really great question, ~Ray
0
 
LVL 5

Author Comment

by:sscotti
ID: 36520332
Here is another question.  In the PDF documents that I've created it looks like the pdf creator is putting text on the slides from the ppt presentation encloded between parentheses:  e.g.

"(52 yo male with palpable )Tj 2.042 -1.182 Td [(lump left breast)"

surrounded by formatting data.  The text that I actually want to search in the pdf is the text between the parentheses.  Is there a way to use a regex in the preg_match function to look for an occurrence of my search text in between parentheses?  If I can do that things may actually work OK.  I don't need preg_match_all.  I am just looking for at least one match, so preg_match works.  I tried it in my code, and the performance seems similar to searching the database with SQL using LIKE "%match%".

Thanks.

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36520381
Yes, you can search for the text between parentheses.  Your REGEX will need to escape the parens (the are metacharacters to REGEX).
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

679 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question