Solved

how to extract the text from pdf using PHP?

Posted on 2009-07-10
2
2,009 Views
Last Modified: 2013-12-13
i have converted the pdf to images, but now i need to extract the content from pdf as text or html, how can i do it.

0
Comment
Question by:Rajmd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 250 total points
ID: 24823642
This may be either a big undertaking or an impossible dream, depending on what you have got in the PDF file.  You are probably better off to go back to the original data BEFORE it became a PDF.  If you cannot get that information in clear text, here is the path to follow...

You can read the PDF files into PHP with file_get_contents();

You can use var_dump() to print out the data you read from the PDF.

You can visually scan the data string for extraction points and perhaps create a REGEX or a set of explode() statements to pull the information you want.

Do not become too dependent on this technology - different levels of PDF files will have different encodings and you may not be able to control what you will find in there.

Best of luck with your project, ~Ray
0
 
LVL 3

Expert Comment

by:Pedro Chagas
ID: 24825600
What is the goal? Objective?
Where you get the pdf's? You create your own pdf's? If so, I think you can use file_get_contents() (like @ray tells you), because the encoding its always the same!

Regards, JC
0

Featured Post

Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
This video Micro Tutorial is the second in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles a…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question