Avatar of hadrons
hadrons
 asked on

Extract non-ASCII text from a PDF cleanly

I have a PDF file that has some non-ASCII text (Hebrew letters) that I want to extract in text form (and then convert them from unicode to HTML - that part I have covered,) but I've been unable to extract those Hebrew letters cleanly using cut & paste and couple of other methods. The file in question is attached here; thanks, Mike
Adobe AcrobatDocument Imaging

Avatar of undefined
Last Comment
Joe Winograd

8/22/2022 - Mon
Joe Winograd

Hi Mike,
I may have a solution for you, but I'd like to test it first on your file, which wasn't attached. Please attach it and I'll work on it right away. Regards, Joe
hadrons

ASKER
Hi, Joe, I left work and I don't have the original file (I can get it later,) but this file I download is a good representation (just the second page, I couldn't figure out how to extract just that and cut the rest.) Thanks, Mike
9780521885423_excerpt.pdf
ASKER CERTIFIED SOLUTION
Joe Winograd

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
hadrons

ASKER
HI, Joe, thanks for all the work looking into this ... here's the original file (we were off for the holiday.) Thanks, Mike
9783039111398_Excerpt_005.pdf
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Joe Winograd

Here's the output from pdffonts for that file:

name                       type              emb sub uni object ID
-------------------------- ----------------- --- --- --- ---------
TimesNewRoman              TrueType          no  no  no       8  0
Verdana                    TrueType          no  no  no      14  0
MDCCHI+AGaramond-Italic    Type 1C           yes yes yes     18  0
MDCCJJ+AGaramond-Regular   Type 1C           yes yes yes     23  0
MDCCNI+MSTT31c344          Type 1C           yes yes no      28  0
MDCCPI+MSTT31c34f          Type 1C           yes yes no      32  0

Open in new window

The Hebrew letters are in the fonts MSTT31c344 and MSTT31c34f (I'm not familiar with either one).

I want to let you know that I'm going offline soon for the rest of today and tonight. Will check back into the thread tomorrow morning to see how you're doing. Regards, Joe
hadrons

ASKER
Identifying the font and importing into what application is being used is the best approach as Joe worked out. The unknown font throw off a solution, but the overall approach suggested is the best.
Joe Winograd

Mike,
Thanks for the update. Regards, Joe
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.