Convert pdf to free text form for fulltext search

Posted on 2011-09-10
Last Modified: 2012-05-12
I have a bunch of pdf files that I have stored as large blobs in a database, which seems to be working well for managing them and displaying them.  I also want to be able to search the text in the documents using PHP and MySQL to look for keywords or phrases using either the LIKE mysql function or FULLTEXT search.  It seems that I would best off converting the PDF blob data into a simple text version with the extracted data in free text form.  Is there a PHP class or function that can do this, and can you suggest how I would then go about searching the free text data for a keyword or phrase.  

I already have the PDF's in a MySQL database table called pdf_files, with the large blob field simply called pdf.
Question by:sscotti
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
LVL 23

Expert Comment

ID: 36517513
Do the PDFs contain actual text data (as opposed to scanned pages of text, which would need to be OCR'ed first).  If they do contain text, there's a command line utility as part of Ghostscript called "pdf2text" that will easily extract out all the textual data from your PDFs.

If you're running windows, you might be able to easily install it using Cygwin.

Author Comment

ID: 36517548
It is actually probably a bit more complicated question.  I've been playing around and it seems there many different formats for PDF documents.  I have basically 2 types and a fair bit of software on my computer.

1.  Powerpoint presentations that I want to convert to PDF files.  I can do this in several ways:

a.  Use Powerpoint to save them as a PDF.
b.  Use the Mac OS print feature to save them as a PDF to disk.
b.  Use the Mac OS print feature to save them as an Adobe PDF (I have Adobe Acrobat Pro also).

2.  Various journal articles from medicine that I get from a number of resources.

It seems that the journal articles and the documents that I could create using the Adobe PDF creator have more information (i.e. the text) than the ones that I create the other ways, which seem to be bitmaps.

I do have a program called Wondershare PDF creator that can apparently also convert PDF's to text.

Just kind of wondering what the best way to go is.

For some of the journal articles I am able to search the PDF LargeBlob directly and get hits with my search string in MySQL, although it seems to be case sensitive there.  Seems like it might be best to create text version of all of the PDF's and store those in the database as well when performing searches since I guess I can do full text searches and get around the case sensitivity as well.  Is that true.

Can you give an example of how to create a query to search for keywords or phrases in the pdf or text doc?
LVL 110

Expert Comment

by:Ray Paseur
ID: 36517556
I think we need to see your test data.  Please post the information you have in your data base blob columns, thanks. ~Ray
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.


Author Comment

ID: 36519161
Well, here are some examples of the PDF created in various ways, as well as the original ppt file, and a converted text file.

They are at: , saved with save as adobe pdf, saved as pdf with os x print, saved as pdf from save as in ppt, original powerpoint file, original powerpoint file

Just examining them with a text/hex editor, using BBEdit, it looks like the one saved as an Adobe PDF file retains the searchable text data, although in binary format since I have it stored as a large blob.

So question.  If I store the raw pdf data as long text instead of a large blob, can I search the raw data without case insensitivity and still use the field to store the pdf.

I think I have the mysql queries worked out for using the LIKE match.

DATABASE with table pdf_files has one table with 6 columns:

pid  INT
section VARCHAR  (Breast, Cardiac, etc)
name VARCHAR (Title for document)
casenumber INT (case number for document)
pdfdata LONGLBLOB (contains binary data for pdf or other type of doc, i.e. word, ppt, etc., just using pdf for now)  can I change to LONGTEXT and do FULLTEXT search and still use for pdfdata, think the pdfdata has to be binary.
pdftext LONGTEXT, trying this to store converted plain text version

query is like:

$query = "SELECT * from pdf_files WHERE pdftext LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";


$query = "SELECT * from pdf_files WHERE pdfdata LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

I don't know how to do a FULLTEXT search, but you can't do that on a LONGBLOB anyway I think.
LVL 110

Accepted Solution

Ray Paseur earned 500 total points
ID: 36519386
I have not made it to testing these files with MySQL yet, but I can show you this much.  The script makes a case-insensitive search of the PDF and PPT files using a regular expression.  It finds three instances of "idiopathic" in each file.

These files are hundreds of thousands of bytes long, and that is not what you would usually want to put into a data base.  So I might organize the application in a way that would enable it to "learn" as it did searches of the files.  In this environment, the pid, section, name and case number would stay in the pdf_files table, but the PDF itself would reside in the server file system.  You would add a DATETIME column along with a URL column that pointed to the PDF or PPT in the server's file system.  You would add a new pdf_keywords table.  This table would contain a search_word, a DATETIME column and a pid column that pointed to the rows of the pdf_files table for each of the PDF files that contained the word.  Whenever a search request arrived, the search script would look for PDF files (if any) that were newer than the most recent row of the pdf_keywords table matching the keyword.  It would search these newer PDF files and update the pdf_keywords table to add a row for every PDF document that matched the search keyword.  Then it would build its client response from the freshly updated pdf_keywords table.

That is a little more complicated to write than a fulltext search, but it will perform many times better.  With the fulltext search, you will do a table scan for every search.  A table scan, to data base programmers, is a little like dog poop - out of place wherever you find it.  If your data base grows to any appreciable size you run the risk that the table scans will seriously degrade performance as you repeatedly scan the large blob or text fields.  In contrast the learning search design will never have to re-scan any of the PDF or PPT files for the same keyword.  Once the keyword is in the pdf_keywords table, only the new, unscanned files will be inspected.  A sentient being might even create lists of "unscannable" words like the, that, is, set, etc.  You could also prime the pump by scanning all the files for expected search words.  You might consider an algorithm that would keep track of words that appear too frequently (like in 90% of the documents) so you could tell the client that the word is unspecific.

Philosophically, you would probably want to keep the files as close as possible to the original text.  So simple PPT files (not PPTX) and PDF files that are compatible with older versions of Adobe Reader will probably serve you best.

HTH, ~Ray
<?php // RAY_temp_sscotti.php
echo "<pre>"; // MAKE IT EASY TO READ

// FROM EE: , saved with save as adobe pdf
echo PHP_EOL . '********************************** PDF *****' . PHP_EOL;
$pdf = file_get_contents('RAY_temp_sscotti.pdf');

// var_dump($pdf);

preg_match_all('/IDIOPATHIC/i', $pdf, $mat);

echo PHP_EOL;

// FROM EE:, original powerpoint file
echo PHP_EOL . '********************************** PPT *****' . PHP_EOL;
$ppt = file_get_contents('RAY_temp_sscotti.ppt');

// var_dump($ppt);

preg_match_all('/IDIOPATHIC/i', $ppt, $mat);
echo PHP_EOL;

Open in new window


Author Comment

ID: 36520220
Hey, thanks for that.  I have gone back and forth with storing the data in a database vs. on the server.  I have another app for radiology that works in a similar fashion, although the BLOB data are mostly pictures there and generally are not any larger than 500K at most each, with a sprinkling of other documents.  I like having them in a database as it is a single file, a little more portable and in some ways easier to manage.  With the arrangement I have now I probably would be better of storing them on the server and putting other parameters in the database.

The actual app is here:

Haven't implemented your script, but will, thanks.

LVL 110

Expert Comment

by:Ray Paseur
ID: 36520268
Thanks for the points.  This is a really great question, ~Ray

Author Comment

ID: 36520332
Here is another question.  In the PDF documents that I've created it looks like the pdf creator is putting text on the slides from the ppt presentation encloded between parentheses:  e.g.

"(52 yo male with palpable )Tj 2.042 -1.182 Td [(lump left breast)"

surrounded by formatting data.  The text that I actually want to search in the pdf is the text between the parentheses.  Is there a way to use a regex in the preg_match function to look for an occurrence of my search text in between parentheses?  If I can do that things may actually work OK.  I don't need preg_match_all.  I am just looking for at least one match, so preg_match works.  I tried it in my code, and the performance seems similar to searching the database with SQL using LIKE "%match%".


LVL 110

Expert Comment

by:Ray Paseur
ID: 36520381
Yes, you can search for the text between parentheses.  Your REGEX will need to escape the parens (the are metacharacters to REGEX).

Featured Post

Get Database Help Now w/ Support & Database Audit

Keeping your database environment tuned, optimized and high-performance is key to achieving business goals. If your database goes down, so does your business. Percona experts have a long history of helping enterprises ensure their databases are running smoothly.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Duplicated data in GROUP_CONCAT 2 52
PHP Web Development 6 69
Echo values after a query in php 5 51
Mysql query one to many 11 41
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL several years ago, it seemed like now was a good time to update it for object-oriented PHP.  This article does that, replacing as much as possible the pr…
This article discusses how to create an extensible mechanism for linked drop downs.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question