Solved

Replace wrong PDF non-latin characters

Posted on 2014-03-07
20
837 Views
Last Modified: 2014-03-11
Hello,

I can use many operating systems, many applications, many programming languages.
But the best language for me is C# .NET. I can learn to use any application.
PDF files often are created with problem : non-latin characters are displayed but do not exist. Let's say : Ž is displayed correctly but then copied is shown incorrectly.
I open PDF file with 'Foxit Reader' Windows, 'Goodreader' iOS. You can copy text anywhere with copy-paste function.
I don't have original file(txt, word) from which PDF was created.
The problem is font : somebody created PDF with wrong font which does not accept non-latin characters. The problem started from wrong font in PDF.
Wrong font name : WtUStoneSerItcTMed7qREumM

Times New Roman accept non-latin characters but I cannot change font for whole PDF. I've tried to change it to Times New Roman for whole PDF page - disaster.
I can only change font for very exact letters with command 'Replace All' Ctrl+H.

I need to have correct non-latin characters so that I could search any word within every page of PDF :
Problem shown in picture
All incorrect symbols :
Incorrect Lithuanian symbols
I am able to fix these wrong letters with 'Foxit Advanced PDF Editor'. However, I am not able to fix these wrong characters with 'Foxit Advanced PDF Editor' then they are at the beginning of paragraph(wildcards are not working for whatever the reason is) :
Lithuanian symbols fixed
But I am not able to fix these :
Lithuanian symbols non-fixedI need to replace incorrect symbols into correct symbols. What other PDF editor applications, systems, etc. do you suggest ?
Here's how I work with "Foxit advanced PDF Editor" :
Foxit Advanced PDF Editor Command 'Replace All' Settings
Thanks
0
Comment
Question by:urban20
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 12
  • 8
20 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39912648
Try Adobe Illustrator.

HTH,
Dan
0
 

Author Comment

by:urban20
ID: 39920526
I have tried Adobe Illustrator CC 17.1 and Adobe Acrobat 11 Pro. Both did not work.
Maybe I need some plugins ?
Here :

Illustrator does not understand font
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920567
No, you just need the fonts.
From what I see, you need ITC Stone Serif and the other one I don't recognize.
Try to replace it with a font that you know supports those characters, since you don't really care about appearance.

HTH,
Dan
0
Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

 

Author Comment

by:urban20
ID: 39920624
Adobe software is even worse.
The thing that I've noticed : the problem may be not within software but with Operational system and it's installed fonts.
I've found that with foxit reader(not editor) it is possible to find these letters :
Able to find but not able to editBut Foxit Reader is not meant for editing PDF file.
'Foxit Advanced PDF Editor' : no possibility of finding above letters.

Interesting
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920637
If you can post a page from the pdf, I can try to find a solution (fonts needed or settings).

Foxit Reader will use the embedded font to search. The pdf Editor will use the fonts on your system.
0
 

Author Comment

by:urban20
ID: 39920709
Here is the PDF file without any modification :

http://www.sendspace.com/file/tjbsb3
0
 

Author Comment

by:urban20
ID: 39920720
I use 'Foxit Advanced PDF Editor' version 3.07
0
 

Author Comment

by:urban20
ID: 39920846
Where could I find ITC stone serif free version ?
0
 

Author Comment

by:urban20
ID: 39920889
Here are two pages with some fixed š letters. However, I've noticed that those wrong š beginning at the start of paragraph have not been fixed. Link :
http://www.sendspace.com/file/jidnyc
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920914
After working a bit on the file, I can tell you that it's not you, it's the file: the Unicode map is corrupt.
See here for an explanation and here for a possible solution: convert the pdf to curves (so all text info is removed) then OCR it.

HTH,
Dan
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920981
Attached is a page from the file, converted to curves and then OCRed. Did not know if it's Lithuanian or Latvian, so the OCR might be wrong, but I can copy and search the text.
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921003
That's the right path. I see that there are several settings for conversion to curves and OCR file.
Could You give exact instructions ? I don't care which software to use. Also, acrobat 9 is different from 11 pro. For whatever software You suggest - also, which software version do You suggest ?

I have tried before converting whole document to tiff files and OCR with ABBYY finereader pro 11. That is working but I need better ways. The problem is always : no bold text for some letters(in original some letters are bold). Every page in ABBYY has the same problem.

Of course, if there is a better a way to OCR file - I need EXACT instructions.

Thanks
0
 

Author Comment

by:urban20
ID: 39921006
Language is lithuanian.
0
 

Author Comment

by:urban20
ID: 39921022
-
0
 
LVL 35

Accepted Solution

by:
Dan Craciun earned 500 total points
ID: 39921040
Have you looked at the link I gave you? http://forums.adobe.com/message/3938668

I used the steps from there, modified for Acrobat 10
1. Tools->Pages->Watermark->Add Watermark
1'. Press space, press OK
2. Tools->Print Production->Flattener Preview
2'. Check "Convert all text to outlines", check "All pages in document", click Apply
3. Tools->Recognize text->In this file
3'. Click Edit and choose the language (Lithuanian), check "All pages", click OK

If you don't find something in the Tools menu in the right side of the page, go to View->Tools and select it from there.

Dan
0
 

Author Comment

by:urban20
ID: 39921104
I've found all the tools in adobe acrobat 11 pro. I will try with different settings in adobe acrobat. Right now solution seems with flaws but maybe I will find right configs.
Flaws :
Solutions error
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921171
Redo the OCR using Lithuanian instead of Latvian and you'll see some major improvements :)
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921204
There are improvements but another problems :)
Is there a workaround without OCR ? We all know that OCR always produce errors.
Foxit Advanced PDF Editor did fix some Lithuanian letters. Even if I would install right fonts - would PDF editor software still not find these letters ?
non-fixed lithuanian characters
Another OCR problems :
OCR is always wrong
0
 

Author Closing Comment

by:urban20
ID: 39921219
OCR solution but are there any ways without OCR ???
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921247
I do think this document has a corrupt Unicode map on purpose, for copy protection.
And it's quite effective.

I did come across similar corrupted documents, but I did not found another solution apart from OCR.

Thanks for the points.

Dan
0

Featured Post

On Demand Webinar - Networking for the Cloud Era

This webinar discusses:
-Common barriers companies experience when moving to the cloud
-How SD-WAN changes the way we look at networks
-Best practices customers should employ moving forward with cloud migration
-What happens behind the scenes of SteelConnect’s one-click button

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Entering time in Microsoft Access can be difficult. An input mask often bothers users more than helping them and won't catch all typing errors. This article shows how to create a textbox for 24-hour time input with full validation politely catching …
Join Greg Farro and Ethan Banks from Packet Pushers (http://packetpushers.net/podcast/podcasts/pq-show-93-smart-network-monitoring-paessler-sponsored/) and Greg Ross from Paessler (https://www.paessler.com/prtg) for a discussion about smart network …
The view will learn how to download and install SIMTOOLS and FORMLIST into Excel, how to use SIMTOOLS to generate a Monte Carlo simulation of 30 sales calls, and how to calculate the conditional probability based on the results of the Monte Carlo …
XMind Plus helps organize all details/aspects of any project from large to small in an orderly and concise manner. If you are working on a complex project, use this micro tutorial to show you how to make a basic flow chart. The software is free when…

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question