Solved

Replace wrong PDF non-latin characters

Posted on 2014-03-07
20
827 Views
Last Modified: 2014-03-11
Hello,

I can use many operating systems, many applications, many programming languages.
But the best language for me is C# .NET. I can learn to use any application.
PDF files often are created with problem : non-latin characters are displayed but do not exist. Let's say : Ž is displayed correctly but then copied is shown incorrectly.
I open PDF file with 'Foxit Reader' Windows, 'Goodreader' iOS. You can copy text anywhere with copy-paste function.
I don't have original file(txt, word) from which PDF was created.
The problem is font : somebody created PDF with wrong font which does not accept non-latin characters. The problem started from wrong font in PDF.
Wrong font name : WtUStoneSerItcTMed7qREumM

Times New Roman accept non-latin characters but I cannot change font for whole PDF. I've tried to change it to Times New Roman for whole PDF page - disaster.
I can only change font for very exact letters with command 'Replace All' Ctrl+H.

I need to have correct non-latin characters so that I could search any word within every page of PDF :
Problem shown in picture
All incorrect symbols :
Incorrect Lithuanian symbols
I am able to fix these wrong letters with 'Foxit Advanced PDF Editor'. However, I am not able to fix these wrong characters with 'Foxit Advanced PDF Editor' then they are at the beginning of paragraph(wildcards are not working for whatever the reason is) :
Lithuanian symbols fixed
But I am not able to fix these :
Lithuanian symbols non-fixedI need to replace incorrect symbols into correct symbols. What other PDF editor applications, systems, etc. do you suggest ?
Here's how I work with "Foxit advanced PDF Editor" :
Foxit Advanced PDF Editor Command 'Replace All' Settings
Thanks
0
Comment
Question by:urban20
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 12
  • 8
20 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39912648
Try Adobe Illustrator.

HTH,
Dan
0
 

Author Comment

by:urban20
ID: 39920526
I have tried Adobe Illustrator CC 17.1 and Adobe Acrobat 11 Pro. Both did not work.
Maybe I need some plugins ?
Here :

Illustrator does not understand font
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920567
No, you just need the fonts.
From what I see, you need ITC Stone Serif and the other one I don't recognize.
Try to replace it with a font that you know supports those characters, since you don't really care about appearance.

HTH,
Dan
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:urban20
ID: 39920624
Adobe software is even worse.
The thing that I've noticed : the problem may be not within software but with Operational system and it's installed fonts.
I've found that with foxit reader(not editor) it is possible to find these letters :
Able to find but not able to editBut Foxit Reader is not meant for editing PDF file.
'Foxit Advanced PDF Editor' : no possibility of finding above letters.

Interesting
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920637
If you can post a page from the pdf, I can try to find a solution (fonts needed or settings).

Foxit Reader will use the embedded font to search. The pdf Editor will use the fonts on your system.
0
 

Author Comment

by:urban20
ID: 39920709
Here is the PDF file without any modification :

http://www.sendspace.com/file/tjbsb3
0
 

Author Comment

by:urban20
ID: 39920720
I use 'Foxit Advanced PDF Editor' version 3.07
0
 

Author Comment

by:urban20
ID: 39920846
Where could I find ITC stone serif free version ?
0
 

Author Comment

by:urban20
ID: 39920889
Here are two pages with some fixed š letters. However, I've noticed that those wrong š beginning at the start of paragraph have not been fixed. Link :
http://www.sendspace.com/file/jidnyc
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920914
After working a bit on the file, I can tell you that it's not you, it's the file: the Unicode map is corrupt.
See here for an explanation and here for a possible solution: convert the pdf to curves (so all text info is removed) then OCR it.

HTH,
Dan
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920981
Attached is a page from the file, converted to curves and then OCRed. Did not know if it's Lithuanian or Latvian, so the OCR might be wrong, but I can copy and search the text.
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921003
That's the right path. I see that there are several settings for conversion to curves and OCR file.
Could You give exact instructions ? I don't care which software to use. Also, acrobat 9 is different from 11 pro. For whatever software You suggest - also, which software version do You suggest ?

I have tried before converting whole document to tiff files and OCR with ABBYY finereader pro 11. That is working but I need better ways. The problem is always : no bold text for some letters(in original some letters are bold). Every page in ABBYY has the same problem.

Of course, if there is a better a way to OCR file - I need EXACT instructions.

Thanks
0
 

Author Comment

by:urban20
ID: 39921006
Language is lithuanian.
0
 

Author Comment

by:urban20
ID: 39921022
-
0
 
LVL 35

Accepted Solution

by:
Dan Craciun earned 500 total points
ID: 39921040
Have you looked at the link I gave you? http://forums.adobe.com/message/3938668

I used the steps from there, modified for Acrobat 10
1. Tools->Pages->Watermark->Add Watermark
1'. Press space, press OK
2. Tools->Print Production->Flattener Preview
2'. Check "Convert all text to outlines", check "All pages in document", click Apply
3. Tools->Recognize text->In this file
3'. Click Edit and choose the language (Lithuanian), check "All pages", click OK

If you don't find something in the Tools menu in the right side of the page, go to View->Tools and select it from there.

Dan
0
 

Author Comment

by:urban20
ID: 39921104
I've found all the tools in adobe acrobat 11 pro. I will try with different settings in adobe acrobat. Right now solution seems with flaws but maybe I will find right configs.
Flaws :
Solutions error
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921171
Redo the OCR using Lithuanian instead of Latvian and you'll see some major improvements :)
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921204
There are improvements but another problems :)
Is there a workaround without OCR ? We all know that OCR always produce errors.
Foxit Advanced PDF Editor did fix some Lithuanian letters. Even if I would install right fonts - would PDF editor software still not find these letters ?
non-fixed lithuanian characters
Another OCR problems :
OCR is always wrong
0
 

Author Closing Comment

by:urban20
ID: 39921219
OCR solution but are there any ways without OCR ???
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921247
I do think this document has a corrupt Unicode map on purpose, for copy protection.
And it's quite effective.

I did come across similar corrupted documents, but I did not found another solution apart from OCR.

Thanks for the points.

Dan
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Samba Question 11 103
Please explain purpose of GZIP 4 56
Why use this lambda? 12 60
Error building VS2105 solution from repository 1 33
I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
Microsoft Office Picture Manager was included in Office 2003, 2007, and 2010, but not in Office 2013. Users had hopes that it would be in Office 2016/Office 365, but it is not. Fortunately, the same zero-cost technique that works to install it with …
This video shows where to find templates, what they are used for, and how to create and save a custom template using Microsoft Word.
An overview on how to enroll an hourly employee into the employee database and how to give them access into the clock in terminal.

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question