We help IT Professionals succeed at work.

PDF Parsers and PHP

dogsareit
dogsareit asked
on
I am using PDFParser (https://www.pdfparser.org/documentation). I understand that it builds upon TCPDF library and both the PDF Parser and TCDF provide examples.  Does anyone know where I can find a more complete manual ??  I am really looking for a PDF parser to XML format using PHP 5X
Thank you.
Comment
Watch Question

Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
What happened to the other libraries?  

With this one, the examples are https://www.pdfparser.org/documentation
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
$text = $pdf->getText();
echo $text;

Open in new window

That will extract the text from a pdf and there are other examples on the page for extracting specific pages or meta data.  

Using the example they show, it looks like $text  will contain just text. If you go to the demo page, https://www.pdfparser.org/demo it shows exactly how it works. That is all this library does.

Now that you have the variable $text you can do with the text as you want. If the pdf is in a specific format, lets say each only contains 3 paragraphs. You can use php to explode https://www.php.net/manual/en/function.explode.php on the line break. That gives you an array with 3 items. You can then turn that array into xml. If it is just one node, you can loop through and create lines by "<something>$data</something>".PHP_EOL.  If it is more complex, look at this option https://www.codexworld.com/convert-array-to-xml-in-php/
//function defination to convert array to xml
function array_to_xml($array, &$xml_user_info) {
    foreach($array as $key => $value) {
        if(is_array($value)) {
            if(!is_numeric($key)){
                $subnode = $xml_user_info->addChild("$key");
                array_to_xml($value, $subnode);
            }else{
                $subnode = $xml_user_info->addChild("item$key");
                array_to_xml($value, $subnode);
            }
        }else {
            $xml_user_info->addChild("$key",htmlspecialchars("$value"));
        }
    }
}

//creating object of SimpleXMLElement
$xml_user_info = new SimpleXMLElement("<?xml version=\"1.0\"?><user_info></user_info>");

//function call to convert array to xml
array_to_xml($users_array,$xml_user_info);

//saving generated xml file
$xml_file = $xml_user_info->asXML('users.xml');

//success and error message based on xml creation
if($xml_file){
    echo 'XML file have been generated successfully.';
}else{
    echo 'XML file generation error.';
}

Open in new window

Author

Commented:
Thank you Scott for responding. Once again you have helped to point me in the right direction. I am going to review/experiment with/the documentation and see what happens ! I will update you in a day or two.
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
You are welcome.  

Would you like somebody else to review this and help too?

Author

Commented:
Thank you Scott for responding and I appreciate it. Very helpful. Yes, the PDFParser is very nice etc and I have used it before - I was just exploring what else might be around and perhaps be a better solution. I am always of the mind that it's always good to know what else might be out there - one never knows. And I thought someone might know of an alternative etc. Sharing is good. Look at what happened to users of iText. Using explode is pretty handy and it has been something I have used many times in the past  breaking on spaces or inserting an  asterisk as a break point etc  But in this case, it is not going to work well.I have two elements that are both alpha and butt up to each other separated by a space maybe two or more spaces (i.e. John Doe   Maine State) at most. I have been able to, with pinpoint accuracy, extract the data by chopping up the file, i.e - get rid of headers/footers information (and other blah blah stuff) which just leaves the rows of data.  
Each row of data has several columns containing numerical data and two columns are strictly all alpha. these two of the columns - name and location - are causing me fits. The reason is that the name may contain spaces (or not) and the location may or may not have spaces. What I have done so far is isolate the numerical data, which are several columns and they occur after the name/location. They were very easy to pick out of the data line. I then eliminate all numerical - commas - periods - currency symbol from the line which leaves me just the location/name. Most of the time, the name does not contain a space - usually just last name but in reviewing the incoming data, that is not always the case. If the name has a space, it has always been at the beginning i.e John Doe or Mary E Smith. So I decided to reversed this remaining string, (trimmed of course) so the last name is first and location second.  I have made the assumption - based on samples of incoming data - that if there are two consecutive spaces, the name ends and the location begins.  And naturally before flopping into the db, reverse the name/location so it is as it should be. I am now puttering around with the coding to count spaces, we will see if it works well !! I would much rather create a xml document and once this coding is done, I am going to try my hand at it, been reading about it converting it - I just think creating a XML document would be a cleaner approach. One of the odd things was just to see how 'XML Friendly' the document really was - I processed it thru 3 different on-line converts - one did very well and the remaining two had problems -  Not a valid pdf format and the third one mangled some of the data.
Just curious.  I am going to close out this question but I am sure I will be posting again concerning one thing or another !!
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:
It sounds like there are multiple things going on. Just take it one at a time.

The first thing it sounds like you need to do is solve spacing. Regardless if this was xml or not, that will still be a problem.  That may require something as simple as trim($data) or something a little more complex.  When you need help for something like that, just stop there ask a new question.  In your new query, just present what you need for that portion.

$str = <<<EOD
Example of string
             spanning multiple lines
using   heredoc syntax.

Another paragraph

John Doe   Maine State   
EOD;

Open in new window


Your question may then be, "How can I get "John Doe" and "Main State" in separate new fields called field1 and field2.

Get that solved than move on to the next.

Author

Commented:
Yes, there was a multitude of things happening except for name/location,  they have been resolved and I believe I have used good coding practices. If I can not have it consistently be correct as to output, I sure will ask. Thank you again, I appreciate you sharing your knowledge and advice. :)
I decided to abandoned the xml route and strictly did a data extraction by text. It has turned out pretty well. One  item I consistently discovered was the the white spaces were not really true white spaces - some strange character from another world. I found it was always at the beginning and ending of a data element. Once I uncovered that fact, it was easy to extract and split where needed. Also, reading the pdf - page - by - page - worked very well too.
I thank you Scott for responding and willingness to help.  And to everyone who responded to my questions, not just this one, I thank you all. :)
Scott FellDeveloper & EE Moderator
Fellow 2018
Most Valuable Expert 2013

Commented:

That sounds like a good plan to just grab the text. 


It is funny how you discover things you didn't know existed when scraping data such as how a space is made.