trevor1940
asked on
extracting text and images from PDFs
Hi
I have a bunch of PDF files that I need to extract text and images from
I have had some limited success using perl swish filter this will extract the text to un-formatted html which I'm guessing is the best i can hope for?
EG all headings, font sizes, colors, links etc come out as plain text
any one had any experience of extracting both text and images in batches
Although I'm a perl programer I'm willing to explore other avenues?
Thanx for your help
I have a bunch of PDF files that I need to extract text and images from
I have had some limited success using perl swish filter this will extract the text to un-formatted html which I'm guessing is the best i can hope for?
EG all headings, font sizes, colors, links etc come out as plain text
any one had any experience of extracting both text and images in batches
Although I'm a perl programer I'm willing to explore other avenues?
Thanx for your help
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
> Any suggestions how to maintain the hyper links or how to insert a tag to the exported image
I don't know how to do that. But I communicate directly with the developer via our private emails and I will be happy to send along the question.
> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
The answer for that is here:
http://www.glyphandcog.com/support/q0016.html
I followed the steps in the article when I first ran into the problem (quite a while ago) and it worked a charm back then. I presume it will still work. Regards, Joe
I don't know how to do that. But I communicate directly with the developer via our private emails and I will be happy to send along the question.
> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
The answer for that is here:
http://www.glyphandcog.com/support/q0016.html
I followed the steps in the article when I first ran into the problem (quite a while ago) and it worked a charm back then. I presume it will still work. Regards, Joe
ASKER
Thanx that would be a huge help
My thinking is some how combine the pdftohtml with pdftoimages?
so instead of having 1 background.png image you have multiple jpg's with individual <IMG Tags>
No idea about the hyper links
Incidentally the article you wrote is that VB?
My thinking is some how combine the pdftohtml with pdftoimages?
so instead of having 1 background.png image you have multiple jpg's with individual <IMG Tags>
No idea about the hyper links
Incidentally the article you wrote is that VB?
> Incidentally the article you wrote is that VB?
No. It's in a language called AutoHotkey, my programming/scripting language of choice for the past few years — excellent and free! There have been many forks of the original language and recently a new community was established to move the language forward. The latest release at the new community has a Windows installer, an offline help file, and a compiler that turns the AHK source code (plain text) into a stand-alone/no-install executable (an EXE file).
There is excellent documentation:
http://ahkscript.org/docs/AutoHotkey.htm
...including an alphabetical command and function index:
http://ahkscript.org/docs/commands/index.htm
...a good tutorial:
http://ahkscript.org/docs/Tutorial.htm
...and an active user forum:
http://ahkscript.org/boards/
Regards, Joe
No. It's in a language called AutoHotkey, my programming/scripting language of choice for the past few years — excellent and free! There have been many forks of the original language and recently a new community was established to move the language forward. The latest release at the new community has a Windows installer, an offline help file, and a compiler that turns the AHK source code (plain text) into a stand-alone/no-install executable (an EXE file).
There is excellent documentation:
http://ahkscript.org/docs/AutoHotkey.htm
...including an alphabetical command and function index:
http://ahkscript.org/docs/commands/index.htm
...a good tutorial:
http://ahkscript.org/docs/Tutorial.htm
...and an active user forum:
http://ahkscript.org/boards/
Regards, Joe
ASKER
OK Thanx
I very much doubt I'd be able to get it past Sys Man they are already having a fit
Please let me know if you hear from the developer maybe there is a switch in pdftohtml.exe that exports individual jpg's?
I very much doubt I'd be able to get it past Sys Man they are already having a fit
Please let me know if you hear from the developer maybe there is a switch in pdftohtml.exe that exports individual jpg's?
> maybe there is a switch in pdftohtml.exe that exports individual jpg's
I don't think so. Here are all of the options in the doc file for pdftohtml:
-f number
Specifies the first page to convert.
-l number
Specifies the last page to convert.
-r
Specifies the resolution, in DPI, for background images. The default is 150 DPI.
-opw password
Specify the owner password for the PDF file. Providing this will bypass all security restrictions.
-upw password
Specify the user password for the PDF file.
-q
Don't print any messages or errors. [config file: errQuiet]
-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config file.
-v
Print copyright and version information.
-h
Print usage information. (-help and --help are equivalent.)
I don't think so. Here are all of the options in the doc file for pdftohtml:
-f number
Specifies the first page to convert.
-l number
Specifies the last page to convert.
-r
Specifies the resolution, in DPI, for background images. The default is 150 DPI.
-opw password
Specify the owner password for the PDF file. Providing this will bypass all security restrictions.
-upw password
Specify the user password for the PDF file.
-q
Don't print any messages or errors. [config file: errQuiet]
-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config file.
-v
Print copyright and version information.
-h
Print usage information. (-help and --help are equivalent.)
I want to make sure I've interpreted your issues correctly, so please let me know if the following is accurate:
Two follow-up questions on pdftohtml:
(1) The HTML file that it creates does not retain hyperlinks that are in the PDF. Can you recommend a work-around for that? Is this feature on the roadmap for a future version of pdftohtml?
(2) The HTML file that it creates has a single <img id="background"> tag that points to the PNG file with the images on that page (all of the images together). Is there any way to have pdftohtml create individual JPGs (like pdfimages) and have separate <img> tags for them in the HTML? If not, can you recommend a work-around for that, perhaps utilizing pdfimages? Is this feature on the roadmap for a future version of pdftohtml?
I'll send this message to the developer, so please recommend changes if I've misunderstood anything. Thanks, Joe
Two follow-up questions on pdftohtml:
(1) The HTML file that it creates does not retain hyperlinks that are in the PDF. Can you recommend a work-around for that? Is this feature on the roadmap for a future version of pdftohtml?
(2) The HTML file that it creates has a single <img id="background"> tag that points to the PNG file with the images on that page (all of the images together). Is there any way to have pdftohtml create individual JPGs (like pdfimages) and have separate <img> tags for them in the HTML? If not, can you recommend a work-around for that, perhaps utilizing pdfimages? Is this feature on the roadmap for a future version of pdftohtml?
I'll send this message to the developer, so please recommend changes if I've misunderstood anything. Thanks, Joe
ASKER
Thanx Joe that's is exactly the issue
OK, I sent it to the developer's private email. Will let you know what I hear back. Regards, Joe
ASKER
Thanx
I'll leave this thread open until I hear back
I'll leave this thread open until I hear back
Here's the response from the developer:
Yes, creating hyperlinks is on the roadmap for a future release.So, basically, there's nothing we can do about it now. Sorry the news isn't better. Regards, Joe
Creating separate images will be harder. The current background image contains everything that's not "simple text" -- that includes images, and also vector graphics, rotated text, etc.
There's also the issue of clipping: the visible image may not be exactly the same as the raw bitmap image (which is what pdfimages extracts).
ASKER
Thanx for the quick reply
Now I know what is and is not possible I can move forward with what i've got
Now I know what is and is not possible I can move forward with what i've got
ASKER
Joe
Thank you very much for your outstanding help
Thank you very much for your outstanding help
You're very welcome. Good luck with the project! All the best, Joe
ASKER
This has been really helpful
This is what i've established
pdfimages.exe extracts images using the -j switch most are exported to jpg but some are pmm
pdftotext.exe extracts un-formatted text but will maintain layout
pdftohtml.exe creates an html page for each page of the html a background.png file is created combining all images on the page, although text formatting is maintained hyper links are lost so can not link back to individual exported images
Any suggestions how to maintain the hyper links or how to insert a tag to the exported image
also i'm getting this error
Config Error: No display font for 'Symbol'
Config Error: No display font for 'ZapfDingbats'
thanx again