?
Solved

how to extract the text from pdf using PHP?

Posted on 2009-07-10
2
Medium Priority
?
2,020 Views
Last Modified: 2013-12-13
i have converted the pdf to images, but now i need to extract the content from pdf as text or html, how can i do it.

0
Comment
Question by:Rajmd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 750 total points
ID: 24823642
This may be either a big undertaking or an impossible dream, depending on what you have got in the PDF file.  You are probably better off to go back to the original data BEFORE it became a PDF.  If you cannot get that information in clear text, here is the path to follow...

You can read the PDF files into PHP with file_get_contents();

You can use var_dump() to print out the data you read from the PDF.

You can visually scan the data string for extraction points and perhaps create a REGEX or a set of explode() statements to pull the information you want.

Do not become too dependent on this technology - different levels of PDF files will have different encodings and you may not be able to control what you will find in there.

Best of luck with your project, ~Ray
0
 
LVL 3

Expert Comment

by:Pedro Chagas
ID: 24825600
What is the goal? Objective?
Where you get the pdf's? You create your own pdf's? If so, I think you can use file_get_contents() (like @ray tells you), because the encoding its always the same!

Regards, JC
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Microsoft Office Picture Manager is not included in Office 2013. This comes as a shock to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This article explains how…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…
Suggested Courses

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question