Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

extracting text and images from PDFs

Hi

I have a bunch of PDF files that I need to extract text and images from

I have had some limited success using perl swish filter this will extract the text to un-formatted  html which I'm  guessing is the best i can hope for?

EG all headings, font sizes, colors, links etc come out as plain text

any one had any experience of extracting both text and images in batches

Although I'm a perl programer I'm willing to explore other avenues?

Thanx for your help
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of trevor1940
trevor1940

ASKER

Thanx Joe
This has been really helpful

This is what i've established

pdfimages.exe extracts images using the -j switch most are exported to jpg but some are pmm

pdftotext.exe extracts un-formatted text but will maintain layout

pdftohtml.exe creates an html page for each page of the html a background.png file is created combining all images on the page, although text formatting is maintained  hyper links are lost so can not link back to individual exported images

Any suggestions how to  maintain the  hyper links or how  to insert a  tag to the exported image

also i'm getting this error

Config Error: No display font for 'Symbol'
Config Error: No display font for 'ZapfDingbats'



thanx again
> Any suggestions how to maintain the hyper links or how to insert a tag to the exported image

I don't know how to do that. But I communicate directly with the developer via our private emails and I will be happy to send along the question.

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'

The answer for that is here:
http://www.glyphandcog.com/support/q0016.html

I followed the steps in the article when I first ran into the problem (quite a while ago) and it worked a charm back then. I presume it will still work. Regards, Joe
Thanx that would be a huge help

My thinking is some how combine the pdftohtml with pdftoimages?

so instead of having 1 background.png image you have multiple jpg's with individual <IMG Tags>

No idea about the hyper links

Incidentally the article you wrote is that VB?
> Incidentally the article you wrote is that VB?

No. It's in a language called AutoHotkey, my programming/scripting language of choice for the past few years — excellent and free! There have been many forks of the original language and recently a new community was established to move the language forward. The latest release at the new community has a Windows installer, an offline help file, and a compiler that turns the AHK source code (plain text) into a stand-alone/no-install executable (an EXE file).

There is excellent documentation:
http://ahkscript.org/docs/AutoHotkey.htm

...including an alphabetical command and function index:
http://ahkscript.org/docs/commands/index.htm

...a good tutorial:
http://ahkscript.org/docs/Tutorial.htm

...and an active user forum:
http://ahkscript.org/boards/

Regards, Joe
OK Thanx

I very much doubt I'd be able to get it past Sys Man they are already having a fit

Please let me know if you hear from the developer maybe there is a switch in pdftohtml.exe that exports individual jpg's?
> maybe there is a switch in pdftohtml.exe that exports individual jpg's

I don't think so. Here are all of the options in the doc file for pdftohtml:

-f number
Specifies the first page to convert.

-l number
Specifies the last page to convert.

-r
Specifies the resolution, in DPI, for background images. The default is 150 DPI.

-opw password
Specify the owner password for the PDF file. Providing this will bypass all security restrictions.

-upw password
Specify the user password for the PDF file.

-q
Don't print any messages or errors. [config file: errQuiet]

-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config file.

-v
Print copyright and version information.

-h
Print usage information. (-help and --help are equivalent.)
I want to make sure I've interpreted your issues correctly, so please let me know if the following is accurate:

Two follow-up questions on pdftohtml:

(1) The HTML file that it creates does not retain hyperlinks that are in the PDF. Can you recommend a work-around for that? Is this feature on the roadmap for a future version of pdftohtml?

(2) The HTML file that it creates has a single <img id="background"> tag that points to the PNG file with the images on that page (all of the images together). Is there any way to have pdftohtml create individual JPGs (like pdfimages) and have separate <img> tags for them in the HTML? If not, can you recommend a work-around for that, perhaps utilizing pdfimages? Is this feature on the roadmap for a future version of pdftohtml?

I'll send this message to the developer, so please recommend changes if I've misunderstood anything. Thanks, Joe
Thanx Joe that's is exactly the issue
OK, I sent it to the developer's private email. Will let you know what I hear back. Regards, Joe
Thanx
I'll leave this thread open until I hear back
Here's the response from the developer:
Yes, creating hyperlinks is on the roadmap for a future release.

Creating separate images will be harder. The current background image contains everything that's not "simple text" -- that includes images, and also vector graphics, rotated text, etc.

There's also the issue of clipping: the visible image may not be exactly the same as the raw bitmap image (which is what pdfimages extracts).
So, basically, there's nothing we can do about it now. Sorry the news isn't better. Regards, Joe
Thanx for the quick reply

Now I know what is and is not possible I can move forward with what i've got
Joe

Thank you very much for your outstanding help
You're very welcome. Good luck with the project! All the best, Joe