PDF parser


I need to extract content from a pdf file and that has each block of text formatted with different font styles. I will be storing this content in a  db. For example

<p style="text-align: justify; margin-bottom: 7px; line-height: 13px;">
<span style="font-family: 'sans-serif'; font-size: 12px; font-weight: bold; color: rgb(0, 0, 0);">name here</span><span style="font-size: 13px; color: #000;"> some text</span><span style="font-family: 'sans-serif'; font-size: 15px; font-weight: normal; color:#000;"> some more text here </span></p>

In the above example, if you see that there are different formatting used i.e. bold text, increased font size and so on. Also, not all the items follow the same styling. Some might not have the bold text and some might have the content with font:15px repeated twice and so on..

Is there any pdf extraction technique(tool) which I can use to select some text and extract everything that is of same format and save the output.


Is there any good PHP parser that can parse based on the text formatting, please let me know.

Who is Participating?
IgiwwaAuthor Commented:

thanks for the reply. I also found these tools and gave a try but it's not helpful. I will deal this differently.

wow that's a tall order.

If the pdf is an image. then the answer is no.

I haven't done this. but here is a link that say you can use xpdf's pdftotext
This might be a step i the right direction for ya.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.