Convert pdf to free text form for fulltext search

I have a bunch of pdf files that I have stored as large blobs in a database, which seems to be working well for managing them and displaying them.  I also want to be able to search the text in the documents using PHP and MySQL to look for keywords or phrases using either the LIKE mysql function or FULLTEXT search.  It seems that I would best off converting the PDF blob data into a simple text version with the extracted data in free text form.  Is there a PHP class or function that can do this, and can you suggest how I would then go about searching the free text data for a keyword or phrase.  

I already have the PDF's in a MySQL database table called pdf_files, with the large blob field simply called pdf.
Who is Participating?
Ray PaseurConnect With a Mentor Commented:
I have not made it to testing these files with MySQL yet, but I can show you this much.  The script makes a case-insensitive search of the PDF and PPT files using a regular expression.  It finds three instances of "idiopathic" in each file.

These files are hundreds of thousands of bytes long, and that is not what you would usually want to put into a data base.  So I might organize the application in a way that would enable it to "learn" as it did searches of the files.  In this environment, the pid, section, name and case number would stay in the pdf_files table, but the PDF itself would reside in the server file system.  You would add a DATETIME column along with a URL column that pointed to the PDF or PPT in the server's file system.  You would add a new pdf_keywords table.  This table would contain a search_word, a DATETIME column and a pid column that pointed to the rows of the pdf_files table for each of the PDF files that contained the word.  Whenever a search request arrived, the search script would look for PDF files (if any) that were newer than the most recent row of the pdf_keywords table matching the keyword.  It would search these newer PDF files and update the pdf_keywords table to add a row for every PDF document that matched the search keyword.  Then it would build its client response from the freshly updated pdf_keywords table.

That is a little more complicated to write than a fulltext search, but it will perform many times better.  With the fulltext search, you will do a table scan for every search.  A table scan, to data base programmers, is a little like dog poop - out of place wherever you find it.  If your data base grows to any appreciable size you run the risk that the table scans will seriously degrade performance as you repeatedly scan the large blob or text fields.  In contrast the learning search design will never have to re-scan any of the PDF or PPT files for the same keyword.  Once the keyword is in the pdf_keywords table, only the new, unscanned files will be inspected.  A sentient being might even create lists of "unscannable" words like the, that, is, set, etc.  You could also prime the pump by scanning all the files for expected search words.  You might consider an algorithm that would keep track of words that appear too frequently (like in 90% of the documents) so you could tell the client that the word is unspecific.

Philosophically, you would probably want to keep the files as close as possible to the original text.  So simple PPT files (not PPTX) and PDF files that are compatible with older versions of Adobe Reader will probably serve you best.

HTH, ~Ray
<?php // RAY_temp_sscotti.php
echo "<pre>"; // MAKE IT EASY TO READ

// FROM EE: , saved with save as adobe pdf
echo PHP_EOL . '********************************** PDF *****' . PHP_EOL;
$pdf = file_get_contents('RAY_temp_sscotti.pdf');

// var_dump($pdf);

preg_match_all('/IDIOPATHIC/i', $pdf, $mat);

echo PHP_EOL;

// FROM EE:, original powerpoint file
echo PHP_EOL . '********************************** PPT *****' . PHP_EOL;
$ppt = file_get_contents('RAY_temp_sscotti.ppt');

// var_dump($ppt);

preg_match_all('/IDIOPATHIC/i', $ppt, $mat);
echo PHP_EOL;

Open in new window

nemws1Database AdministratorCommented:
Do the PDFs contain actual text data (as opposed to scanned pages of text, which would need to be OCR'ed first).  If they do contain text, there's a command line utility as part of Ghostscript called "pdf2text" that will easily extract out all the textual data from your PDFs.

If you're running windows, you might be able to easily install it using Cygwin.
sscottiAuthor Commented:
It is actually probably a bit more complicated question.  I've been playing around and it seems there many different formats for PDF documents.  I have basically 2 types and a fair bit of software on my computer.

1.  Powerpoint presentations that I want to convert to PDF files.  I can do this in several ways:

a.  Use Powerpoint to save them as a PDF.
b.  Use the Mac OS print feature to save them as a PDF to disk.
b.  Use the Mac OS print feature to save them as an Adobe PDF (I have Adobe Acrobat Pro also).

2.  Various journal articles from medicine that I get from a number of resources.

It seems that the journal articles and the documents that I could create using the Adobe PDF creator have more information (i.e. the text) than the ones that I create the other ways, which seem to be bitmaps.

I do have a program called Wondershare PDF creator that can apparently also convert PDF's to text.

Just kind of wondering what the best way to go is.

For some of the journal articles I am able to search the PDF LargeBlob directly and get hits with my search string in MySQL, although it seems to be case sensitive there.  Seems like it might be best to create text version of all of the PDF's and store those in the database as well when performing searches since I guess I can do full text searches and get around the case sensitivity as well.  Is that true.

Can you give an example of how to create a query to search for keywords or phrases in the pdf or text doc?
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

Ray PaseurCommented:
I think we need to see your test data.  Please post the information you have in your data base blob columns, thanks. ~Ray
sscottiAuthor Commented:
Well, here are some examples of the PDF created in various ways, as well as the original ppt file, and a converted text file.

They are at: , saved with save as adobe pdf, saved as pdf with os x print, saved as pdf from save as in ppt, original powerpoint file, original powerpoint file

Just examining them with a text/hex editor, using BBEdit, it looks like the one saved as an Adobe PDF file retains the searchable text data, although in binary format since I have it stored as a large blob.

So question.  If I store the raw pdf data as long text instead of a large blob, can I search the raw data without case insensitivity and still use the field to store the pdf.

I think I have the mysql queries worked out for using the LIKE match.

DATABASE with table pdf_files has one table with 6 columns:

pid  INT
section VARCHAR  (Breast, Cardiac, etc)
name VARCHAR (Title for document)
casenumber INT (case number for document)
pdfdata LONGLBLOB (contains binary data for pdf or other type of doc, i.e. word, ppt, etc., just using pdf for now)  can I change to LONGTEXT and do FULLTEXT search and still use for pdfdata, think the pdfdata has to be binary.
pdftext LONGTEXT, trying this to store converted plain text version

query is like:

$query = "SELECT * from pdf_files WHERE pdftext LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";


$query = "SELECT * from pdf_files WHERE pdfdata LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

I don't know how to do a FULLTEXT search, but you can't do that on a LONGBLOB anyway I think.
sscottiAuthor Commented:
Hey, thanks for that.  I have gone back and forth with storing the data in a database vs. on the server.  I have another app for radiology that works in a similar fashion, although the BLOB data are mostly pictures there and generally are not any larger than 500K at most each, with a sprinkling of other documents.  I like having them in a database as it is a single file, a little more portable and in some ways easier to manage.  With the arrangement I have now I probably would be better of storing them on the server and putting other parameters in the database.

The actual app is here:

Haven't implemented your script, but will, thanks.

Ray PaseurCommented:
Thanks for the points.  This is a really great question, ~Ray
sscottiAuthor Commented:
Here is another question.  In the PDF documents that I've created it looks like the pdf creator is putting text on the slides from the ppt presentation encloded between parentheses:  e.g.

"(52 yo male with palpable )Tj 2.042 -1.182 Td [(lump left breast)"

surrounded by formatting data.  The text that I actually want to search in the pdf is the text between the parentheses.  Is there a way to use a regex in the preg_match function to look for an occurrence of my search text in between parentheses?  If I can do that things may actually work OK.  I don't need preg_match_all.  I am just looking for at least one match, so preg_match works.  I tried it in my code, and the performance seems similar to searching the database with SQL using LIKE "%match%".


Ray PaseurCommented:
Yes, you can search for the text between parentheses.  Your REGEX will need to escape the parens (the are metacharacters to REGEX).
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.