Avatar of Member_2_6672227
Member_2_6672227 asked on

Replace wrong PDF non-latin characters

Hello,

I can use many operating systems, many applications, many programming languages.
But the best language for me is C# .NET. I can learn to use any application.
PDF files often are created with problem : non-latin characters are displayed but do not exist. Let's say : Ž is displayed correctly but then copied is shown incorrectly.
I open PDF file with 'Foxit Reader' Windows, 'Goodreader' iOS. You can copy text anywhere with copy-paste function.
I don't have original file(txt, word) from which PDF was created.
The problem is font : somebody created PDF with wrong font which does not accept non-latin characters. The problem started from wrong font in PDF.
Wrong font name : WtUStoneSerItcTMed7qREumM

Times New Roman accept non-latin characters but I cannot change font for whole PDF. I've tried to change it to Times New Roman for whole PDF page - disaster.
I can only change font for very exact letters with command 'Replace All' Ctrl+H.

I need to have correct non-latin characters so that I could search any word within every page of PDF :
Problem shown in picture
All incorrect symbols :
Incorrect Lithuanian symbols
I am able to fix these wrong letters with 'Foxit Advanced PDF Editor'. However, I am not able to fix these wrong characters with 'Foxit Advanced PDF Editor' then they are at the beginning of paragraph(wildcards are not working for whatever the reason is) :
Lithuanian symbols fixed
But I am not able to fix these :
Lithuanian symbols non-fixedI need to replace incorrect symbols into correct symbols. What other PDF editor applications, systems, etc. do you suggest ?
Here's how I work with "Foxit advanced PDF Editor" :
Foxit Advanced PDF Editor Command 'Replace All' Settings
Thanks
Office ProductivityLinuxMicrosoft Development

Avatar of undefined
Last Comment
Dan Craciun

8/22/2022 - Mon
Dan Craciun

Try Adobe Illustrator.

HTH,
Dan
ASKER
Member_2_6672227

I have tried Adobe Illustrator CC 17.1 and Adobe Acrobat 11 Pro. Both did not work.
Maybe I need some plugins ?
Here :

Illustrator does not understand font
Dan Craciun

No, you just need the fonts.
From what I see, you need ITC Stone Serif and the other one I don't recognize.
Try to replace it with a font that you know supports those characters, since you don't really care about appearance.

HTH,
Dan
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
ASKER
Member_2_6672227

Adobe software is even worse.
The thing that I've noticed : the problem may be not within software but with Operational system and it's installed fonts.
I've found that with foxit reader(not editor) it is possible to find these letters :
Able to find but not able to editBut Foxit Reader is not meant for editing PDF file.
'Foxit Advanced PDF Editor' : no possibility of finding above letters.

Interesting
Dan Craciun

If you can post a page from the pdf, I can try to find a solution (fonts needed or settings).

Foxit Reader will use the embedded font to search. The pdf Editor will use the fonts on your system.
ASKER
Member_2_6672227

Here is the PDF file without any modification :

http://www.sendspace.com/file/tjbsb3
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
ASKER
Member_2_6672227

I use 'Foxit Advanced PDF Editor' version 3.07
ASKER
Member_2_6672227

Where could I find ITC stone serif free version ?
ASKER
Member_2_6672227

Here are two pages with some fixed š letters. However, I've noticed that those wrong š beginning at the start of paragraph have not been fixed. Link :
http://www.sendspace.com/file/jidnyc
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Dan Craciun

After working a bit on the file, I can tell you that it's not you, it's the file: the Unicode map is corrupt.
See here for an explanation and here for a possible solution: convert the pdf to curves (so all text info is removed) then OCR it.

HTH,
Dan
Dan Craciun

Attached is a page from the file, converted to curves and then OCRed. Did not know if it's Lithuanian or Latvian, so the OCR might be wrong, but I can copy and search the text.
Pages-from-NT.pdf
ASKER
Member_2_6672227

That's the right path. I see that there are several settings for conversion to curves and OCR file.
Could You give exact instructions ? I don't care which software to use. Also, acrobat 9 is different from 11 pro. For whatever software You suggest - also, which software version do You suggest ?

I have tried before converting whole document to tiff files and OCR with ABBYY finereader pro 11. That is working but I need better ways. The problem is always : no bold text for some letters(in original some letters are bold). Every page in ABBYY has the same problem.

Of course, if there is a better a way to OCR file - I need EXACT instructions.

Thanks
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
ASKER
Member_2_6672227

Language is lithuanian.
ASKER
Member_2_6672227

-
ASKER CERTIFIED SOLUTION
Dan Craciun

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
ASKER
Member_2_6672227

I've found all the tools in adobe acrobat 11 pro. I will try with different settings in adobe acrobat. Right now solution seems with flaws but maybe I will find right configs.
Flaws :
Solutions error
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Dan Craciun

Redo the OCR using Lithuanian instead of Latvian and you'll see some major improvements :)
Pages-from-NT.pdf
ASKER
Member_2_6672227

There are improvements but another problems :)
Is there a workaround without OCR ? We all know that OCR always produce errors.
Foxit Advanced PDF Editor did fix some Lithuanian letters. Even if I would install right fonts - would PDF editor software still not find these letters ?
non-fixed lithuanian characters
Another OCR problems :
OCR is always wrong
ASKER
Member_2_6672227

OCR solution but are there any ways without OCR ???
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Dan Craciun

I do think this document has a corrupt Unicode map on purpose, for copy protection.
And it's quite effective.

I did come across similar corrupted documents, but I did not found another solution apart from OCR.

Thanks for the points.

Dan