Link to home
Start Free TrialLog in
Avatar of Marthaj
MarthajFlag for United States of America

asked on

Extracting Data from PDF Using Java or PHP

I need some advice/guidance/suggestions. I am using localhost,  Win 10 . both IIS and WampServer  (Apache 2.4.23) are running (each have their own listening ports), and have both PHP 5.6.25 and 7.0.10 installed. I sometimes do work with classical ASP. With all being said, I have the need to extract data from pdf files and insert the data into a db.
I thought it would be easier for me to convert them to XML documents. This needs to be achieved programmatically and not at the CLI level.
I have pretty much decided that Java would be good to use (although I am far more comfortable with PHP) and have already posted two questions concerning the project.
WHAT I think I need is to have a IDE for my Java. I normally use Notepad++ (and sometimes CodeIgniter) Would Netbeans IDE be good to use ??
I am not very versed with Java, I am better with PHP.
If anyone has any suggestions and/or pointers for me about any of this - I am all ears and listening.
Thank you.
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

Don't be afraid of the CLI tools.  You can call CLI from php using exec https://www.php.net/manual/en/function.exec.php.  With that, there are some open source tools you can use.  

xpdf reader is one http://www.xpdfreader.com/support.html  You an then do something like

To make it easier, there is a port using php https://php-xpdf.readthedocs.io/en/latest/
use Monolog\Logger;
use Monolog\Handler\NullHandler;
use XPDF\PdfToText;

// Create a logger
$logger = new Logger('MyLogger');
$logger->pushHandler(new NullHandler());

// You have to pass a Monolog logger
// This logger provides some usefull infos about what's happening
$pdfToText = PdfToText::load($logger);

// open PDF
$pdfToText->open('PDF-book.pdf');

// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();

Open in new window


Also https://github.com/mgufrone/pdf-to-html that has required downloads that can be covered with composer.
// convert to html string
$html = $pdf->html();

// convert a specific page to html string
$page = $pdf->html(3);

// convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser)
$dom = $pdf->getDom();

// check if your pdf has more than one pages
$total_pages = $pdf->getPages();

// Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3
$dom->goToPage(3);

// and then you can do as you please with that dom, you can find any element you want
$paragraphs = $dom->find('body > p');

Open in new window


There are other options for products that end up costing more money than it is worth.  I have some projects that I ended up using http://www.pdfsharp.net/ and pdftk https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/


I would try the php-xpdf library first though.  When you use it, first just create a pdf file using just a simple sentence.  Then keep adding on. This will be helpful if your pdf is complex to extract.  Sometimes there are errors that can stem from multiple options and starting off simple will help eliminate possibilities.
If you do go the way of command line, you can easily use classic asp too.

set WshShell = wscript.CreateObject("Wscript.Shell")
WshShell.Run("program.exe " & "command line code")

Open in new window

Avatar of Marthaj

ASKER

Thank you Scott for responding. You have given much to digest and have helped to clear some of the clouds away. I will close this question in another day or two....just in case !! Again, I appreciate the sharing of knowledge.
Avatar of Marthaj

ASKER

I have been working my way thru this process and have encountered a snag.
I am receiving the error: "Class 'XPDF\PdfToText' not found in C:\wamp\www\PDFConvert\monolog.php on line 16".
And that makes sense to me since I do not have a "php-xpdf" directory nowhere that I can find !

I followed the instructions at this URL:
https://php-xpdf.readthedocs.io/en/latest/

Open in new window


I have confirmed that composer is installed in my project directory. I have the composer.json and the vendor folder was created.  And the only thing I could find to download at url:
http://www.xpdfreader.com/support.html

Open in new window



This is what my composer.json contains: (I added the "php-xpdf/php-xpdf": "master" line)
{
    "require": {
        "monolog/monolog": "^1.25",
        "php-console/php-console": "^3.1"
	"php-xpdf/php-xpdf": "master"
    }
}

Open in new window


I am thinking that I need to download and install the PHP-XPDF wrapper  from github at this URL:
https://github.com/alchemy-fr/PHP-XPDF

Open in new window


Would I be right in needing to download the PHP-XPDF wrapper  from github ??
Thank you for any help....
ASKER CERTIFIED SOLUTION
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Marthaj

ASKER

I apologize for being so late to close this question out - got sidetracked with the coding !!
ANd thank you again for responding and sharing your knowledge.  I do appreciate it.
Avatar of Marthaj

ASKER

Thank you again !!