I need to extract content from a pdf file and that has each block of text formatted with different font styles. I will be storing this content in a db. For example
<p style="text-align: justify; margin-bottom: 7px; line-height: 13px;">
<span style="font-family: 'sans-serif'; font-size: 12px; font-weight: bold; color: rgb(0, 0, 0);">name here</span><span style="font-size: 13px; color: #000;"> some text</span><span style="font-family: 'sans-serif'; font-size: 15px; font-weight: normal; color:#000;"> some more text here </span></p>
In the above example, if you see that there are different formatting used i.e. bold text, increased font size and so on. Also, not all the items follow the same styling. Some might not have the bold text and some might have the content with font:15px repeated twice and so on..
Is there any pdf extraction technique(tool) which I can use to select some text and extract everything that is of same format and save the output.
Is there any good PHP parser that can parse based on the text formatting, please let me know.