Solved

Replace wrong PDF non-latin characters

Posted on 2014-03-07
20
848 Views
Last Modified: 2014-03-11
Hello,

I can use many operating systems, many applications, many programming languages.
But the best language for me is C# .NET. I can learn to use any application.
PDF files often are created with problem : non-latin characters are displayed but do not exist. Let's say : Ž is displayed correctly but then copied is shown incorrectly.
I open PDF file with 'Foxit Reader' Windows, 'Goodreader' iOS. You can copy text anywhere with copy-paste function.
I don't have original file(txt, word) from which PDF was created.
The problem is font : somebody created PDF with wrong font which does not accept non-latin characters. The problem started from wrong font in PDF.
Wrong font name : WtUStoneSerItcTMed7qREumM

Times New Roman accept non-latin characters but I cannot change font for whole PDF. I've tried to change it to Times New Roman for whole PDF page - disaster.
I can only change font for very exact letters with command 'Replace All' Ctrl+H.

I need to have correct non-latin characters so that I could search any word within every page of PDF :
Problem shown in picture
All incorrect symbols :
Incorrect Lithuanian symbols
I am able to fix these wrong letters with 'Foxit Advanced PDF Editor'. However, I am not able to fix these wrong characters with 'Foxit Advanced PDF Editor' then they are at the beginning of paragraph(wildcards are not working for whatever the reason is) :
Lithuanian symbols fixed
But I am not able to fix these :
Lithuanian symbols non-fixedI need to replace incorrect symbols into correct symbols. What other PDF editor applications, systems, etc. do you suggest ?
Here's how I work with "Foxit advanced PDF Editor" :
Foxit Advanced PDF Editor Command 'Replace All' Settings
Thanks
0
Comment
Question by:urban20
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 12
  • 8
20 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39912648
Try Adobe Illustrator.

HTH,
Dan
0
 

Author Comment

by:urban20
ID: 39920526
I have tried Adobe Illustrator CC 17.1 and Adobe Acrobat 11 Pro. Both did not work.
Maybe I need some plugins ?
Here :

Illustrator does not understand font
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920567
No, you just need the fonts.
From what I see, you need ITC Stone Serif and the other one I don't recognize.
Try to replace it with a font that you know supports those characters, since you don't really care about appearance.

HTH,
Dan
0
Turn your laptop into a mobile console!

The CV211 Laptop USB Console Adapter provides a direct Laptop-to-Computer connection for fast and easy remote desktop access with no software to install.

 

Author Comment

by:urban20
ID: 39920624
Adobe software is even worse.
The thing that I've noticed : the problem may be not within software but with Operational system and it's installed fonts.
I've found that with foxit reader(not editor) it is possible to find these letters :
Able to find but not able to editBut Foxit Reader is not meant for editing PDF file.
'Foxit Advanced PDF Editor' : no possibility of finding above letters.

Interesting
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920637
If you can post a page from the pdf, I can try to find a solution (fonts needed or settings).

Foxit Reader will use the embedded font to search. The pdf Editor will use the fonts on your system.
0
 

Author Comment

by:urban20
ID: 39920709
Here is the PDF file without any modification :

http://www.sendspace.com/file/tjbsb3
0
 

Author Comment

by:urban20
ID: 39920720
I use 'Foxit Advanced PDF Editor' version 3.07
0
 

Author Comment

by:urban20
ID: 39920846
Where could I find ITC stone serif free version ?
0
 

Author Comment

by:urban20
ID: 39920889
Here are two pages with some fixed š letters. However, I've noticed that those wrong š beginning at the start of paragraph have not been fixed. Link :
http://www.sendspace.com/file/jidnyc
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920914
After working a bit on the file, I can tell you that it's not you, it's the file: the Unicode map is corrupt.
See here for an explanation and here for a possible solution: convert the pdf to curves (so all text info is removed) then OCR it.

HTH,
Dan
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39920981
Attached is a page from the file, converted to curves and then OCRed. Did not know if it's Lithuanian or Latvian, so the OCR might be wrong, but I can copy and search the text.
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921003
That's the right path. I see that there are several settings for conversion to curves and OCR file.
Could You give exact instructions ? I don't care which software to use. Also, acrobat 9 is different from 11 pro. For whatever software You suggest - also, which software version do You suggest ?

I have tried before converting whole document to tiff files and OCR with ABBYY finereader pro 11. That is working but I need better ways. The problem is always : no bold text for some letters(in original some letters are bold). Every page in ABBYY has the same problem.

Of course, if there is a better a way to OCR file - I need EXACT instructions.

Thanks
0
 

Author Comment

by:urban20
ID: 39921006
Language is lithuanian.
0
 

Author Comment

by:urban20
ID: 39921022
-
0
 
LVL 35

Accepted Solution

by:
Dan Craciun earned 500 total points
ID: 39921040
Have you looked at the link I gave you? http://forums.adobe.com/message/3938668

I used the steps from there, modified for Acrobat 10
1. Tools->Pages->Watermark->Add Watermark
1'. Press space, press OK
2. Tools->Print Production->Flattener Preview
2'. Check "Convert all text to outlines", check "All pages in document", click Apply
3. Tools->Recognize text->In this file
3'. Click Edit and choose the language (Lithuanian), check "All pages", click OK

If you don't find something in the Tools menu in the right side of the page, go to View->Tools and select it from there.

Dan
0
 

Author Comment

by:urban20
ID: 39921104
I've found all the tools in adobe acrobat 11 pro. I will try with different settings in adobe acrobat. Right now solution seems with flaws but maybe I will find right configs.
Flaws :
Solutions error
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921171
Redo the OCR using Lithuanian instead of Latvian and you'll see some major improvements :)
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
ID: 39921204
There are improvements but another problems :)
Is there a workaround without OCR ? We all know that OCR always produce errors.
Foxit Advanced PDF Editor did fix some Lithuanian letters. Even if I would install right fonts - would PDF editor software still not find these letters ?
non-fixed lithuanian characters
Another OCR problems :
OCR is always wrong
0
 

Author Closing Comment

by:urban20
ID: 39921219
OCR solution but are there any ways without OCR ???
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39921247
I do think this document has a corrupt Unicode map on purpose, for copy protection.
And it's quite effective.

I did come across similar corrupted documents, but I did not found another solution apart from OCR.

Thanks for the points.

Dan
0

Featured Post

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
The advancement in technology has been a great source of betterment and empowerment for the human race, Nevertheless, this is not to say that technology doesn’t have any problems. We are bombarded with constant distractions, whether as an overload o…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question