Solved

Convert pdf to free text form for fulltext search

Posted on 2011-09-10
9
445 Views
Last Modified: 2012-05-12
I have a bunch of pdf files that I have stored as large blobs in a database, which seems to be working well for managing them and displaying them.  I also want to be able to search the text in the documents using PHP and MySQL to look for keywords or phrases using either the LIKE mysql function or FULLTEXT search.  It seems that I would best off converting the PDF blob data into a simple text version with the extracted data in free text form.  Is there a PHP class or function that can do this, and can you suggest how I would then go about searching the free text data for a keyword or phrase.  

I already have the PDF's in a MySQL database table called pdf_files, with the large blob field simply called pdf.
0
Comment
Question by:sscotti
  • 4
  • 4
9 Comments
 
LVL 23

Expert Comment

by:nemws1
ID: 36517513
Do the PDFs contain actual text data (as opposed to scanned pages of text, which would need to be OCR'ed first).  If they do contain text, there's a command line utility as part of Ghostscript called "pdf2text" that will easily extract out all the textual data from your PDFs.

If you're running windows, you might be able to easily install it using Cygwin.
0
 
LVL 5

Author Comment

by:sscotti
ID: 36517548
It is actually probably a bit more complicated question.  I've been playing around and it seems there many different formats for PDF documents.  I have basically 2 types and a fair bit of software on my computer.

1.  Powerpoint presentations that I want to convert to PDF files.  I can do this in several ways:

a.  Use Powerpoint to save them as a PDF.
b.  Use the Mac OS print feature to save them as a PDF to disk.
b.  Use the Mac OS print feature to save them as an Adobe PDF (I have Adobe Acrobat Pro also).

2.  Various journal articles from medicine that I get from a number of resources.


It seems that the journal articles and the documents that I could create using the Adobe PDF creator have more information (i.e. the text) than the ones that I create the other ways, which seem to be bitmaps.

I do have a program called Wondershare PDF creator that can apparently also convert PDF's to text.

Just kind of wondering what the best way to go is.

For some of the journal articles I am able to search the PDF LargeBlob directly and get hits with my search string in MySQL, although it seems to be case sensitive there.  Seems like it might be best to create text version of all of the PDF's and store those in the database as well when performing searches since I guess I can do full text searches and get around the case sensitivity as well.  Is that true.

Can you give an example of how to create a query to search for keywords or phrases in the pdf or text doc?
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36517556
I think we need to see your test data.  Please post the information you have in your data base blob columns, thanks. ~Ray
0
 
LVL 5

Author Comment

by:sscotti
ID: 36519161
Well, here are some examples of the PDF created in various ways, as well as the original ppt file, and a converted text file.

They are at:

http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
http://sscotti.com/CaseBreast1macos.pdf, saved as pdf with os x print
http://sscotti.com/CaseBreast1ppt.pdf, saved as pdf from save as in ppt
http://sscotti.com/CaseBreast1.ppt, original powerpoint file
http://sscotti.com/CaseBreast1.txt, original powerpoint file

Just examining them with a text/hex editor, using BBEdit, it looks like the one saved as an Adobe PDF file retains the searchable text data, although in binary format since I have it stored as a large blob.

So question.  If I store the raw pdf data as long text instead of a large blob, can I search the raw data without case insensitivity and still use the field to store the pdf.

I think I have the mysql queries worked out for using the LIKE match.

DATABASE with table pdf_files has one table with 6 columns:

pid  INT
section VARCHAR  (Breast, Cardiac, etc)
name VARCHAR (Title for document)
casenumber INT (case number for document)
pdfdata LONGLBLOB (contains binary data for pdf or other type of doc, i.e. word, ppt, etc., just using pdf for now)  can I change to LONGTEXT and do FULLTEXT search and still use for pdfdata, think the pdfdata has to be binary.
pdftext LONGTEXT, trying this to store converted plain text version

query is like:

$query = "SELECT * from pdf_files WHERE pdftext LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

or

$query = "SELECT * from pdf_files WHERE pdfdata LIKE '%".postUrlStringValue('searchText',null)."%' ORDER BY section, casenumber, name ASC";

I don't know how to do a FULLTEXT search, but you can't do that on a LONGBLOB anyway I think.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36519386
I have not made it to testing these files with MySQL yet, but I can show you this much.  The script makes a case-insensitive search of the PDF and PPT files using a regular expression.  It finds three instances of "idiopathic" in each file.
http://www.laprbass.com/RAY_temp_sscotti.php

These files are hundreds of thousands of bytes long, and that is not what you would usually want to put into a data base.  So I might organize the application in a way that would enable it to "learn" as it did searches of the files.  In this environment, the pid, section, name and case number would stay in the pdf_files table, but the PDF itself would reside in the server file system.  You would add a DATETIME column along with a URL column that pointed to the PDF or PPT in the server's file system.  You would add a new pdf_keywords table.  This table would contain a search_word, a DATETIME column and a pid column that pointed to the rows of the pdf_files table for each of the PDF files that contained the word.  Whenever a search request arrived, the search script would look for PDF files (if any) that were newer than the most recent row of the pdf_keywords table matching the keyword.  It would search these newer PDF files and update the pdf_keywords table to add a row for every PDF document that matched the search keyword.  Then it would build its client response from the freshly updated pdf_keywords table.

That is a little more complicated to write than a fulltext search, but it will perform many times better.  With the fulltext search, you will do a table scan for every search.  A table scan, to data base programmers, is a little like dog poop - out of place wherever you find it.  If your data base grows to any appreciable size you run the risk that the table scans will seriously degrade performance as you repeatedly scan the large blob or text fields.  In contrast the learning search design will never have to re-scan any of the PDF or PPT files for the same keyword.  Once the keyword is in the pdf_keywords table, only the new, unscanned files will be inspected.  A sentient being might even create lists of "unscannable" words like the, that, is, set, etc.  You could also prime the pump by scanning all the files for expected search words.  You might consider an algorithm that would keep track of words that appear too frequently (like in 90% of the documents) so you could tell the client that the word is unspecific.

Philosophically, you would probably want to keep the files as close as possible to the original text.  So simple PPT files (not PPTX) and PDF files that are compatible with older versions of Adobe Reader will probably serve you best.

HTH, ~Ray
<?php // RAY_temp_sscotti.php
error_reporting(E_ALL);
echo "<pre>"; // MAKE IT EASY TO READ


// FROM EE: http://sscotti.com/CaseBreast1adobe.pdf , saved with save as adobe pdf
echo PHP_EOL . '********************************** PDF *****' . PHP_EOL;
$pdf = file_get_contents('RAY_temp_sscotti.pdf');

// ACTIVATE THIS TO SEE THE ENTIRE PDF
// var_dump($pdf);

// MAKE A CASE-INSENSITIVE SEARCH
preg_match_all('/IDIOPATHIC/i', $pdf, $mat);
var_dump($mat);


echo PHP_EOL;


// FROM EE: http://sscotti.com/CaseBreast1.ppt, original powerpoint file
echo PHP_EOL . '********************************** PPT *****' . PHP_EOL;
$ppt = file_get_contents('RAY_temp_sscotti.ppt');

// ACTIVATE THIS TO SEE THE ENTIRE PPT
// var_dump($ppt);

// MAKE A CASE-INSENSITIVE SEARCH
preg_match_all('/IDIOPATHIC/i', $ppt, $mat);
var_dump($mat);
echo PHP_EOL;

Open in new window

0
 
LVL 5

Author Comment

by:sscotti
ID: 36520220
Hey, thanks for that.  I have gone back and forth with storing the data in a database vs. on the server.  I have another app for radiology that works in a similar fashion, although the BLOB data are mostly pictures there and generally are not any larger than 500K at most each, with a sprinkling of other documents.  I like having them in a database as it is a single file, a little more portable and in some ways easier to manage.  With the arrangement I have now I probably would be better of storing them on the server and putting other parameters in the database.

The actual app is here:  sscotti.com

Haven't implemented your script, but will, thanks.

Stephen
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36520268
Thanks for the points.  This is a really great question, ~Ray
0
 
LVL 5

Author Comment

by:sscotti
ID: 36520332
Here is another question.  In the PDF documents that I've created it looks like the pdf creator is putting text on the slides from the ppt presentation encloded between parentheses:  e.g.

"(52 yo male with palpable )Tj 2.042 -1.182 Td [(lump left breast)"

surrounded by formatting data.  The text that I actually want to search in the pdf is the text between the parentheses.  Is there a way to use a regex in the preg_match function to look for an occurrence of my search text in between parentheses?  If I can do that things may actually work OK.  I don't need preg_match_all.  I am just looking for at least one match, so preg_match works.  I tried it in my code, and the performance seems similar to searching the database with SQL using LIKE "%match%".

Thanks.

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36520381
Yes, you can search for the text between parentheses.  Your REGEX will need to escape the parens (the are metacharacters to REGEX).
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
How to Insert a File Using Text Editor 9 53
MySQL database data submission 7 39
Not needed 13 55
Setting Up Local Lan Web Server 1 15
Foreword In the years since this article was written, numerous hacking attacks have targeted password-protected web sites.  The storage of client passwords has become a subject of much discussion, some of it useful and some of it misguided.  Of cou…
As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task. Preface The other day, I was involved …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to count occurrences of each item in an array.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now