Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Convert pdf to free text form for fulltext search

Posted on 2011-09-10
Medium Priority
Last Modified: 2012-05-12
I have a bunch of pdf files that I have stored as large blobs in a database, which seems to be working well for managing them and displaying them.  I also want to be able to search the text in the documents using PHP and MySQL to look for keywords or phrases using either the LIKE mysql function or FULLTEXT search.  It seems that I would best off converting the PDF blob data into a simple text version with the extracted data in free text form.  Is there a PHP class or function that can do this, and can you suggest how I would then go about searching the free text data for a keyword or phrase.  

I already have the PDF's in a MySQL database table called pdf_files, with the large blob field simply called pdf.
Question by:sscotti
  • 4
  • 4
LVL 23

Expert Comment

ID: 36517513
Do the PDFs contain actual text data (as opposed to scanned pages of text, which would need to be OCR'ed first).  If they do contain text, there's a command line utility as part of Ghostscript called "pdf2text" that will easily extract out all the textual data from your PDFs.

If you're running windows, you might be able to easily install it using Cygwin.

Author Comment

ID: 36517548
It is actually probably a bit more complicated question.  I've been playing around and it seems there many different formats for PDF documents.  I have basically 2 types and a fair bit of software on my computer.

1.  Powerpoint presentations that I want to convert to PDF files.  I can do this in several ways:

a.  Use Powerpoint to save them as a PDF.
b.  Use the Mac OS print feature to save them as a PDF to disk.
b.  Use the Mac OS print feature to save them as an Adobe PDF (I have Adobe Acrobat Pro also).

2.  Various journal articles from medicine that I get from a number of resources.

It seems that the journal articles and the documents that I could create using the Adobe PDF creator have more information (i.e. the text) than the ones that I create the other ways, which seem to be bitmaps.

I do have a program called Wondershare PDF creator that can apparently also convert PDF's to text.

Just kind of wondering what the best way to go is.

For some of the journal articles I am able to search the PDF LargeBlob directly and get hits with my search string in MySQL, although it seems to be case sensitive there.  Seems like it might be best to create text version of all of the PDF's and store those in the database as well when performing searches since I guess I can do full text searches and get around the case sensitivity as well.  Is that true.

Can you give an example of how to create a query to search for keywords or phrases in the pdf or text doc?
LVL 111

Expert Comment

by:Ray Paseur
ID: 36517556
I think we need to see your test data.  Please post the information you have in your data base blob columns, thanks. ~Ray
Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.


Author Comment

ID: 36519161
Well, here are some examples of the PDF created in various ways, as well as the original ppt file, and a converted text file.

They are at:

http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
http://sscotti.com/CaseBreast1macos.pdf, saved as pdf with os x print
http://sscotti.com/CaseBreast1ppt.pdf, saved as pdf from save as in ppt
http://sscotti.com/CaseBreast1.ppt, original powerpoint file
http://sscotti.com/CaseBreast1.txt, original powerpoint file

Just examining them with a text/hex editor, using BBEdit, it looks like the one saved as an Adobe PDF file retains the searchable text data, although in binary format since I have it stored as a large blob.

So question.  If I store the raw pdf data as long text instead of a large blob, can I search the raw data without case insensitivity and still use the field to store the pdf.

I think I have the mysql queries worked out for using the LIKE match.

DATABASE with table pdf_files has one table with 6 columns:

pid  INT
section VARCHAR  (Breast, Cardiac, etc)
name VARCHAR (Title for document)
casenumber INT (case number for document)
pdfdata LONGLBLOB (contains binary data for pdf or other type of doc, i.e. word, ppt, etc., just using pdf for now)  can I change to LONGTEXT and do FULLTEXT search and still use for pdfdata, think the pdfdata has to be binary.
pdftext LONGTEXT, trying this to store converted plain text version

query is like:

$query = "SELECT * from pdf_files WHERE pdftext LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";


$query = "SELECT * from pdf_files WHERE pdfdata LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

I don't know how to do a FULLTEXT search, but you can't do that on a LONGBLOB anyway I think.
LVL 111

Accepted Solution

Ray Paseur earned 2000 total points
ID: 36519386
I have not made it to testing these files with MySQL yet, but I can show you this much.  The script makes a case-insensitive search of the PDF and PPT files using a regular expression.  It finds three instances of "idiopathic" in each file.

These files are hundreds of thousands of bytes long, and that is not what you would usually want to put into a data base.  So I might organize the application in a way that would enable it to "learn" as it did searches of the files.  In this environment, the pid, section, name and case number would stay in the pdf_files table, but the PDF itself would reside in the server file system.  You would add a DATETIME column along with a URL column that pointed to the PDF or PPT in the server's file system.  You would add a new pdf_keywords table.  This table would contain a search_word, a DATETIME column and a pid column that pointed to the rows of the pdf_files table for each of the PDF files that contained the word.  Whenever a search request arrived, the search script would look for PDF files (if any) that were newer than the most recent row of the pdf_keywords table matching the keyword.  It would search these newer PDF files and update the pdf_keywords table to add a row for every PDF document that matched the search keyword.  Then it would build its client response from the freshly updated pdf_keywords table.

That is a little more complicated to write than a fulltext search, but it will perform many times better.  With the fulltext search, you will do a table scan for every search.  A table scan, to data base programmers, is a little like dog poop - out of place wherever you find it.  If your data base grows to any appreciable size you run the risk that the table scans will seriously degrade performance as you repeatedly scan the large blob or text fields.  In contrast the learning search design will never have to re-scan any of the PDF or PPT files for the same keyword.  Once the keyword is in the pdf_keywords table, only the new, unscanned files will be inspected.  A sentient being might even create lists of "unscannable" words like the, that, is, set, etc.  You could also prime the pump by scanning all the files for expected search words.  You might consider an algorithm that would keep track of words that appear too frequently (like in 90% of the documents) so you could tell the client that the word is unspecific.

Philosophically, you would probably want to keep the files as close as possible to the original text.  So simple PPT files (not PPTX) and PDF files that are compatible with older versions of Adobe Reader will probably serve you best.

HTH, ~Ray
<?php // RAY_temp_sscotti.php
echo "<pre>"; // MAKE IT EASY TO READ

// FROM EE: http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
echo PHP_EOL . '********************************** PDF *****' . PHP_EOL;
$pdf = file_get_contents('RAY_temp_sscotti.pdf');

// var_dump($pdf);

preg_match_all('/IDIOPATHIC/i', $pdf, $mat);

echo PHP_EOL;

// FROM EE: http://sscotti.com/CaseBreast1.ppt, original powerpoint file
echo PHP_EOL . '********************************** PPT *****' . PHP_EOL;
$ppt = file_get_contents('RAY_temp_sscotti.ppt');

// var_dump($ppt);

preg_match_all('/IDIOPATHIC/i', $ppt, $mat);
echo PHP_EOL;

Open in new window


Author Comment

ID: 36520220
Hey, thanks for that.  I have gone back and forth with storing the data in a database vs. on the server.  I have another app for radiology that works in a similar fashion, although the BLOB data are mostly pictures there and generally are not any larger than 500K at most each, with a sprinkling of other documents.  I like having them in a database as it is a single file, a little more portable and in some ways easier to manage.  With the arrangement I have now I probably would be better of storing them on the server and putting other parameters in the database.

The actual app is here:  sscotti.com

Haven't implemented your script, but will, thanks.

LVL 111

Expert Comment

by:Ray Paseur
ID: 36520268
Thanks for the points.  This is a really great question, ~Ray

Author Comment

ID: 36520332
Here is another question.  In the PDF documents that I've created it looks like the pdf creator is putting text on the slides from the ppt presentation encloded between parentheses:  e.g.

"(52 yo male with palpable )Tj 2.042 -1.182 Td [(lump left breast)"

surrounded by formatting data.  The text that I actually want to search in the pdf is the text between the parentheses.  Is there a way to use a regex in the preg_match function to look for an occurrence of my search text in between parentheses?  If I can do that things may actually work OK.  I don't need preg_match_all.  I am just looking for at least one match, so preg_match works.  I tried it in my code, and the performance seems similar to searching the database with SQL using LIKE "%match%".


LVL 111

Expert Comment

by:Ray Paseur
ID: 36520381
Yes, you can search for the text between parentheses.  Your REGEX will need to escape the parens (the are metacharacters to REGEX).

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
It’s a season to be thankful, and we’re thankful for users like you who engage on site, solve technology problems, and network with others in the industry. What tech are we most thankful for? Keep reading.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

579 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question