Solved

Replace wrong PDF non-latin characters

Posted on 2014-03-07
20
793 Views
Last Modified: 2014-03-11
Hello,

I can use many operating systems, many applications, many programming languages.
But the best language for me is C# .NET. I can learn to use any application.
PDF files often are created with problem : non-latin characters are displayed but do not exist. Let's say : Ž is displayed correctly but then copied is shown incorrectly.
I open PDF file with 'Foxit Reader' Windows, 'Goodreader' iOS. You can copy text anywhere with copy-paste function.
I don't have original file(txt, word) from which PDF was created.
The problem is font : somebody created PDF with wrong font which does not accept non-latin characters. The problem started from wrong font in PDF.
Wrong font name : WtUStoneSerItcTMed7qREumM

Times New Roman accept non-latin characters but I cannot change font for whole PDF. I've tried to change it to Times New Roman for whole PDF page - disaster.
I can only change font for very exact letters with command 'Replace All' Ctrl+H.

I need to have correct non-latin characters so that I could search any word within every page of PDF :
Problem shown in picture
All incorrect symbols :
Incorrect Lithuanian symbols
I am able to fix these wrong letters with 'Foxit Advanced PDF Editor'. However, I am not able to fix these wrong characters with 'Foxit Advanced PDF Editor' then they are at the beginning of paragraph(wildcards are not working for whatever the reason is) :
Lithuanian symbols fixed
But I am not able to fix these :
Lithuanian symbols non-fixedI need to replace incorrect symbols into correct symbols. What other PDF editor applications, systems, etc. do you suggest ?
Here's how I work with "Foxit advanced PDF Editor" :
Foxit Advanced PDF Editor Command 'Replace All' Settings
Thanks
0
Comment
Question by:urban20
  • 12
  • 8
20 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
Try Adobe Illustrator.

HTH,
Dan
0
 

Author Comment

by:urban20
Comment Utility
I have tried Adobe Illustrator CC 17.1 and Adobe Acrobat 11 Pro. Both did not work.
Maybe I need some plugins ?
Here :

Illustrator does not understand font
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
No, you just need the fonts.
From what I see, you need ITC Stone Serif and the other one I don't recognize.
Try to replace it with a font that you know supports those characters, since you don't really care about appearance.

HTH,
Dan
0
 

Author Comment

by:urban20
Comment Utility
Adobe software is even worse.
The thing that I've noticed : the problem may be not within software but with Operational system and it's installed fonts.
I've found that with foxit reader(not editor) it is possible to find these letters :
Able to find but not able to editBut Foxit Reader is not meant for editing PDF file.
'Foxit Advanced PDF Editor' : no possibility of finding above letters.

Interesting
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
If you can post a page from the pdf, I can try to find a solution (fonts needed or settings).

Foxit Reader will use the embedded font to search. The pdf Editor will use the fonts on your system.
0
 

Author Comment

by:urban20
Comment Utility
Here is the PDF file without any modification :

http://www.sendspace.com/file/tjbsb3
0
 

Author Comment

by:urban20
Comment Utility
I use 'Foxit Advanced PDF Editor' version 3.07
0
 

Author Comment

by:urban20
Comment Utility
Where could I find ITC stone serif free version ?
0
 

Author Comment

by:urban20
Comment Utility
Here are two pages with some fixed š letters. However, I've noticed that those wrong š beginning at the start of paragraph have not been fixed. Link :
http://www.sendspace.com/file/jidnyc
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
After working a bit on the file, I can tell you that it's not you, it's the file: the Unicode map is corrupt.
See here for an explanation and here for a possible solution: convert the pdf to curves (so all text info is removed) then OCR it.

HTH,
Dan
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
Attached is a page from the file, converted to curves and then OCRed. Did not know if it's Lithuanian or Latvian, so the OCR might be wrong, but I can copy and search the text.
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
Comment Utility
That's the right path. I see that there are several settings for conversion to curves and OCR file.
Could You give exact instructions ? I don't care which software to use. Also, acrobat 9 is different from 11 pro. For whatever software You suggest - also, which software version do You suggest ?

I have tried before converting whole document to tiff files and OCR with ABBYY finereader pro 11. That is working but I need better ways. The problem is always : no bold text for some letters(in original some letters are bold). Every page in ABBYY has the same problem.

Of course, if there is a better a way to OCR file - I need EXACT instructions.

Thanks
0
 

Author Comment

by:urban20
Comment Utility
Language is lithuanian.
0
 

Author Comment

by:urban20
Comment Utility
-
0
 
LVL 34

Accepted Solution

by:
Dan Craciun earned 500 total points
Comment Utility
Have you looked at the link I gave you? http://forums.adobe.com/message/3938668

I used the steps from there, modified for Acrobat 10
1. Tools->Pages->Watermark->Add Watermark
1'. Press space, press OK
2. Tools->Print Production->Flattener Preview
2'. Check "Convert all text to outlines", check "All pages in document", click Apply
3. Tools->Recognize text->In this file
3'. Click Edit and choose the language (Lithuanian), check "All pages", click OK

If you don't find something in the Tools menu in the right side of the page, go to View->Tools and select it from there.

Dan
0
 

Author Comment

by:urban20
Comment Utility
I've found all the tools in adobe acrobat 11 pro. I will try with different settings in adobe acrobat. Right now solution seems with flaws but maybe I will find right configs.
Flaws :
Solutions error
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
Redo the OCR using Lithuanian instead of Latvian and you'll see some major improvements :)
Pages-from-NT.pdf
0
 

Author Comment

by:urban20
Comment Utility
There are improvements but another problems :)
Is there a workaround without OCR ? We all know that OCR always produce errors.
Foxit Advanced PDF Editor did fix some Lithuanian letters. Even if I would install right fonts - would PDF editor software still not find these letters ?
non-fixed lithuanian characters
Another OCR problems :
OCR is always wrong
0
 

Author Closing Comment

by:urban20
Comment Utility
OCR solution but are there any ways without OCR ???
0
 
LVL 34

Expert Comment

by:Dan Craciun
Comment Utility
I do think this document has a corrupt Unicode map on purpose, for copy protection.
And it's quite effective.

I did come across similar corrupted documents, but I did not found another solution apart from OCR.

Thanks for the points.

Dan
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Meetings to discuss business process can waste time, and often do .  The meeting's dialog can get confusing when participants have different professional perspectives and backgrounds.  A jointly-developed process picture helps wade through the confu…
This article will shed light on the latest trends when it comes to your resume building needs. For far too long, the traditional CV format has monopolized the recruitment market.
The viewer will learn how to simulate a series of sales calls dependent on a single skill level and learn how to simulate a series of sales calls dependent on two skill levels. Simulating Independent Sales Calls: Enter .75 into cell C2 – “skill leve…
This is Part 3 in a 3-part series on Experts Exchange to discuss error handling in VBA code written for Excel. Part 1 of this series discussed basic error handling code using VBA. http://www.experts-exchange.com/videos/1478/Excel-Error-Handlin…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now