Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Extract non-ASCII text from a PDF cleanly

Posted on 2016-09-02
7
Medium Priority
?
162 Views
Last Modified: 2016-09-07
I have a PDF file that has some non-ASCII text (Hebrew letters) that I want to extract in text form (and then convert them from unicode to HTML - that part I have covered,) but I've been unable to extract those Hebrew letters cleanly using cut & paste and couple of other methods. The file in question is attached here; thanks, Mike
0
Comment
Question by:hadrons
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
7 Comments
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41782161
Hi Mike,
I may have a solution for you, but I'd like to test it first on your file, which wasn't attached. Please attach it and I'll work on it right away. Regards, Joe
0
 

Author Comment

by:hadrons
ID: 41782289
Hi, Joe, I left work and I don't have the original file (I can get it later,) but this file I download is a good representation (just the second page, I couldn't figure out how to extract just that and cut the rest.) Thanks, Mike
9780521885423_excerpt.pdf
0
 
LVL 56

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 41782337
Mike,
I extracted just page 2 into a new PDF (attached). I'll see what I can do with it. Shalom, Joe
MikePage2.pdf

Update: I think the problem is that the PDF uses a font called NewJerusalem for the Hebrew letters. It's likely that whatever you're extracting the letters into (such as Word) does not have that font. So my first suggestion is to install the NewJerusalem font in whatever product into which you're extracting the letters.

Btw, I used a utility called PDFfonts to see what fonts are in that file. Here they are:

name                        type              emb sub uni object ID
--------------------------- ----------------- --- --- --- ---------
ILOLMG+NewJerusalem         Type 1C           yes yes yes     45  0
JGMAGA+Georgia              TrueType          yes yes no      47  0
IMGGBM+Times-BoldItalic     TrueType          yes yes no      49  0
IMGGDM+Times-Roman          TrueType          yes yes no      51  0
IMGGML+Times-Bold           TrueType          yes yes no      53  0
IMGKCA+TranslitLS-Bold      TrueType          yes yes no      55  0
IMGKFN+TimesNewRoman-Bold   TrueType          yes yes no      57  0
IMGKNG+TranslitLS           TrueType          yes yes no      59  0
ILOHMB+TranslitLS           CID TrueType      yes yes no      62  0
IMGOPD+TranslitLS-Bold      TrueType          yes yes no      64  0
IMGPDA+TranslitLS           TrueType          yes yes no      66  0

Open in new window

But since you said that you left work and don't have the original file, it's possible that the original file is using some other font for the Hebrew letters. Post the original file when you get back to work and I'll let you know what fonts are in it. Or you can do it yourself, as explained in this 5-minute EE video Micro Tutorial:
Xpdf - PDFfonts - Command Line Utility to List Fonts Used in a PDF File

You should also view the first 5-minute video in the series, which explains how to download all the Xpdf utilities:
Xpdf - Command Line Utility for PDF Files

Regards, Joe
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 

Author Comment

by:hadrons
ID: 41786726
HI, Joe, thanks for all the work looking into this ... here's the original file (we were off for the holiday.) Thanks, Mike
9783039111398_Excerpt_005.pdf
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41786829
Here's the output from pdffonts for that file:

name                       type              emb sub uni object ID
-------------------------- ----------------- --- --- --- ---------
TimesNewRoman              TrueType          no  no  no       8  0
Verdana                    TrueType          no  no  no      14  0
MDCCHI+AGaramond-Italic    Type 1C           yes yes yes     18  0
MDCCJJ+AGaramond-Regular   Type 1C           yes yes yes     23  0
MDCCNI+MSTT31c344          Type 1C           yes yes no      28  0
MDCCPI+MSTT31c34f          Type 1C           yes yes no      32  0

Open in new window

The Hebrew letters are in the fonts MSTT31c344 and MSTT31c34f (I'm not familiar with either one).

I want to let you know that I'm going offline soon for the rest of today and tonight. Will check back into the thread tomorrow morning to see how you're doing. Regards, Joe
0
 

Author Closing Comment

by:hadrons
ID: 41788484
Identifying the font and importing into what application is being used is the best approach as Joe worked out. The unknown font throw off a solution, but the overall approach suggested is the best.
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41788638
Mike,
Thanks for the update. Regards, Joe
0

Featured Post

Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In a previously published article (http://www.experts-exchange.com/articles/10331/Automatic-Duplex-Scanning-in-PaperPort-Versions-11-12-14.html) here at Experts Exchange, I explained how to achieve duplex (double-sided) scanning in Nuance's PaperPor…
PaperPort (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) is among the most important applications that I run on my Windows computers. I use it every day, for nearly all of my document and photo scanning, as well as most of my…
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…
In this sixth video of the Xpdf series, we discuss and demonstrate the PDFtoPNG utility, which converts a multi-page PDF file to separate color, grayscale, or monochrome PNG files, creating one PNG file for each page in the PDF. It does this via a c…

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question