asked on

extracting text and images from PDFs

Hi

I have a bunch of PDF files that I need to extract text and images from

I have had some limited success using perl swish filter this will extract the text to un-formatted html which I'm guessing is the best i can hope for?

EG all headings, font sizes, colors, links etc come out as plain text

any one had any experience of extracting both text and images in batches

Although I'm a perl programer I'm willing to explore other avenues?

Thanx for your help

ASKER CERTIFIED SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

trevor1940

ASKER

Thanx Joe
This has been really helpful

This is what i've established

pdfimages.exe extracts images using the -j switch most are exported to jpg but some are pmm

pdftotext.exe extracts un-formatted text but will maintain layout

pdftohtml.exe creates an html page for each page of the html a background.png file is created combining all images on the page, although text formatting is maintained hyper links are lost so can not link back to individual exported images

Any suggestions how to maintain the hyper links or how to insert a tag to the exported image

also i'm getting this error

Config Error: No display font for 'Symbol'
Config Error: No display font for 'ZapfDingbats'

thanx again

Joe Winograd

> Any suggestions how to maintain the hyper links or how to insert a tag to the exported image

I don't know how to do that. But I communicate directly with the developer via our private emails and I will be happy to send along the question.

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'

The answer for that is here:
http://www.glyphandcog.com/support/q0016.html

I followed the steps in the article when I first ran into the problem (quite a while ago) and it worked a charm back then. I presume it will still work. Regards, Joe

trevor1940

ASKER

Thanx that would be a huge help

My thinking is some how combine the pdftohtml with pdftoimages?

so instead of having 1 background.png image you have multiple jpg's with individual <IMG Tags>

No idea about the hyper links

Incidentally the article you wrote is that VB?

Joe Winograd

> Incidentally the article you wrote is that VB?

No. It's in a language called AutoHotkey, my programming/scripting language of choice for the past few years — excellent and free! There have been many forks of the original language and recently a new community was established to move the language forward. The latest release at the new community has a Windows installer, an offline help file, and a compiler that turns the AHK source code (plain text) into a stand-alone/no-install executable (an EXE file).

There is excellent documentation:
http://ahkscript.org/docs/AutoHotkey.htm

...including an alphabetical command and function index:
http://ahkscript.org/docs/commands/index.htm

...a good tutorial:
http://ahkscript.org/docs/Tutorial.htm

...and an active user forum:
http://ahkscript.org/boards/

Regards, Joe

trevor1940

ASKER

OK Thanx

I very much doubt I'd be able to get it past Sys Man they are already having a fit

Please let me know if you hear from the developer maybe there is a switch in pdftohtml.exe that exports individual jpg's?

Joe Winograd

> maybe there is a switch in pdftohtml.exe that exports individual jpg's

I don't think so. Here are all of the options in the doc file for pdftohtml:

-f number
Specifies the first page to convert.

-l number
Specifies the last page to convert.

-r
Specifies the resolution, in DPI, for background images. The default is 150 DPI.

-opw password
Specify the owner password for the PDF file. Providing this will bypass all security restrictions.

-upw password
Specify the user password for the PDF file.

-q
Don't print any messages or errors. [config file: errQuiet]

-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config file.

-v
Print copyright and version information.

-h
Print usage information. (-help and --help are equivalent.)

Joe Winograd

I want to make sure I've interpreted your issues correctly, so please let me know if the following is accurate:

Two follow-up questions on pdftohtml:

(1) The HTML file that it creates does not retain hyperlinks that are in the PDF. Can you recommend a work-around for that? Is this feature on the roadmap for a future version of pdftohtml?

(2) The HTML file that it creates has a single <img id="background"> tag that points to the PNG file with the images on that page (all of the images together). Is there any way to have pdftohtml create individual JPGs (like pdfimages) and have separate <img> tags for them in the HTML? If not, can you recommend a work-around for that, perhaps utilizing pdfimages? Is this feature on the roadmap for a future version of pdftohtml?

I'll send this message to the developer, so please recommend changes if I've misunderstood anything. Thanks, Joe

trevor1940

ASKER

Thanx Joe that's is exactly the issue

Joe Winograd

OK, I sent it to the developer's private email. Will let you know what I hear back. Regards, Joe

trevor1940

ASKER

Thanx
I'll leave this thread open until I hear back

Joe Winograd

Here's the response from the developer:

Yes, creating hyperlinks is on the roadmap for a future release.

Creating separate images will be harder. The current background image contains everything that's not "simple text" -- that includes images, and also vector graphics, rotated text, etc.

There's also the issue of clipping: the visible image may not be exactly the same as the raw bitmap image (which is what pdfimages extracts).

So, basically, there's nothing we can do about it now. Sorry the news isn't better. Regards, Joe

trevor1940

ASKER

Thanx for the quick reply

Now I know what is and is not possible I can move forward with what i've got

trevor1940

ASKER

Joe

Thank you very much for your outstanding help

Joe Winograd

You're very welcome. Good luck with the project! All the best, Joe