We help IT Professionals succeed at work.

Extracting Data from PDF Using Java or PHP

dogsareit
dogsareit asked
on
I need some advice/guidance/suggestions. I am using localhost,  Win 10 . both IIS and WampServer  (Apache 2.4.23) are running (each have their own listening ports), and have both PHP 5.6.25 and 7.0.10 installed. I sometimes do work with classical ASP. With all being said, I have the need to extract data from pdf files and insert the data into a db.
I thought it would be easier for me to convert them to XML documents. This needs to be achieved programmatically and not at the CLI level.
I have pretty much decided that Java would be good to use (although I am far more comfortable with PHP) and have already posted two questions concerning the project.
WHAT I think I need is to have a IDE for my Java. I normally use Notepad++ (and sometimes CodeIgniter) Would Netbeans IDE be good to use ??
I am not very versed with Java, I am better with PHP.
If anyone has any suggestions and/or pointers for me about any of this - I am all ears and listening.
Thank you.
Comment
Watch Question

Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
Don't be afraid of the CLI tools.  You can call CLI from php using exec https://www.php.net/manual/en/function.exec.php.  With that, there are some open source tools you can use.  

xpdf reader is one http://www.xpdfreader.com/support.html  You an then do something like

To make it easier, there is a port using php https://php-xpdf.readthedocs.io/en/latest/
use Monolog\Logger;
use Monolog\Handler\NullHandler;
use XPDF\PdfToText;

// Create a logger
$logger = new Logger('MyLogger');
$logger->pushHandler(new NullHandler());

// You have to pass a Monolog logger
// This logger provides some usefull infos about what's happening
$pdfToText = PdfToText::load($logger);

// open PDF
$pdfToText->open('PDF-book.pdf');

// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();

Open in new window


Also https://github.com/mgufrone/pdf-to-html that has required downloads that can be covered with composer.
// convert to html string
$html = $pdf->html();

// convert a specific page to html string
$page = $pdf->html(3);

// convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser)
$dom = $pdf->getDom();

// check if your pdf has more than one pages
$total_pages = $pdf->getPages();

// Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3
$dom->goToPage(3);

// and then you can do as you please with that dom, you can find any element you want
$paragraphs = $dom->find('body > p');

Open in new window


There are other options for products that end up costing more money than it is worth.  I have some projects that I ended up using http://www.pdfsharp.net/ and pdftk https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/


I would try the php-xpdf library first though.  When you use it, first just create a pdf file using just a simple sentence.  Then keep adding on. This will be helpful if your pdf is complex to extract.  Sometimes there are errors that can stem from multiple options and starting off simple will help eliminate possibilities.
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
If you do go the way of command line, you can easily use classic asp too.

set WshShell = wscript.CreateObject("Wscript.Shell")
WshShell.Run("program.exe " & "command line code")

Open in new window

Author

Commented:
Thank you Scott for responding. You have given much to digest and have helped to clear some of the clouds away. I will close this question in another day or two....just in case !! Again, I appreciate the sharing of knowledge.

Author

Commented:
I have been working my way thru this process and have encountered a snag.
I am receiving the error: "Class 'XPDF\PdfToText' not found in C:\wamp\www\PDFConvert\monolog.php on line 16".
And that makes sense to me since I do not have a "php-xpdf" directory nowhere that I can find !

I followed the instructions at this URL:
https://php-xpdf.readthedocs.io/en/latest/

Open in new window


I have confirmed that composer is installed in my project directory. I have the composer.json and the vendor folder was created.  And the only thing I could find to download at url:
http://www.xpdfreader.com/support.html

Open in new window



This is what my composer.json contains: (I added the "php-xpdf/php-xpdf": "master" line)
{
    "require": {
        "monolog/monolog": "^1.25",
        "php-console/php-console": "^3.1"
	"php-xpdf/php-xpdf": "master"
    }
}

Open in new window


I am thinking that I need to download and install the PHP-XPDF wrapper  from github at this URL:
https://github.com/alchemy-fr/PHP-XPDF

Open in new window


Would I be right in needing to download the PHP-XPDF wrapper  from github ??
Thank you for any help....
Developer & EE Moderator
Fellow 2018
Most Valuable Expert 2013
Commented:
Yes, it looks like you may need to install xpdf. On the git hub site it shows the composer.json should be   "php-xpdf/php-xpdf": "~0.2.

https://github.com/alchemy-fr/PHP-XPDF

It also reads, "In order to use PHP-XPDF, you need to install XPDF. Depending of your configuration, please follow the instructions at on the XPDF website."  http://www.xpdfreader.com/download.html

Make sure that is installed to.  The php-xpdf just makes it easier for you than using command line.

Author

Commented:
I apologize for being so late to close this question out - got sidetracked with the coding !!
ANd thank you again for responding and sharing your knowledge.  I do appreciate it.

Author

Commented:
Thank you again !!