rgb192
asked on
- end of lines
This question is a followup to
https://www.experts-exchange.com/questions/28429437/delete-page-numbers.html
I scanned in an old book using neat.com ocr.
Many
-
separating words at end of line
example
sep-
arating
which looks good in an old book but not in .doc file
note: file was a .pdf before it became a .doc
but a .doc solution would be easier because I already have the file
I could easily do the acrobat or nuance pdf solution and then convert to .doc
https://www.experts-exchange.com/questions/28429437/delete-page-numbers.html
I scanned in an old book using neat.com ocr.
Many
-
separating words at end of line
example
sep-
arating
which looks good in an old book but not in .doc file
note: file was a .pdf before it became a .doc
but a .doc solution would be easier because I already have the file
I could easily do the acrobat or nuance pdf solution and then convert to .doc
Hi rgb192,
The problem is that there's no way to distinguish between the word being hyphenated and being a compound word. For example, look at the one page from your previous question — OCR fixed some of the hyphenations:
sup-port
princi-ples
informa-tion
heal-ing
But it did not fix:
SELF-HEALTH
well-being
It's good that it didn't fix those, because those are compound words and the hyphen should remain. Even if those were spilt across sentences, such as "well-" at the end of one line and "being" at the beginning of the next, the hyphen should remain. So it's a tricky issue. And words like "re-creation" and "recreation" made it even trickier.
Btw, I can't explain why OCR handled "sup-port", "princi-ples", "informa-tion", and "heal-ing" correctly on that page, but did not handle "Smother-ing", "re-currence", "ho-listic", and "be-cause". Regards, Joe
The problem is that there's no way to distinguish between the word being hyphenated and being a compound word. For example, look at the one page from your previous question — OCR fixed some of the hyphenations:
sup-port
princi-ples
informa-tion
heal-ing
But it did not fix:
SELF-HEALTH
well-being
It's good that it didn't fix those, because those are compound words and the hyphen should remain. Even if those were spilt across sentences, such as "well-" at the end of one line and "being" at the beginning of the next, the hyphen should remain. So it's a tricky issue. And words like "re-creation" and "recreation" made it even trickier.
Btw, I can't explain why OCR handled "sup-port", "princi-ples", "informa-tion", and "heal-ing" correctly on that page, but did not handle "Smother-ing", "re-currence", "ho-listic", and "be-cause". Regards, Joe
ASKER
Find: -^p
Replace:
(Nothing)
this seems like a good idea, but which text editor can I use this
and do i copy paste back to microsoft word or nuance pdf?
maybe it can cure
"Smother-ing", "re-currence", "ho-listic", and "be-cause"
That is specific to Microsoft Word.
Have you posted a sample Word document anywhere? If not, it might help to see a bit of what you are dealing with?
Have you posted a sample Word document anywhere? If not, it might help to see a bit of what you are dealing with?
ASKER
Thank you.
Each line is, in Word terms, a paragraph. Also, the hyphenation has introduced a space after the hyphen, so that my suggestion won't work. Instead, put a space after the hyphen in the Find, so that it becomes
Find: - ^p
Replace:
(Still nothing)
I will try to find a way to join the lines of the original paragraphs together so that the text flows as intended (It may need some VBA coding).
Each line is, in Word terms, a paragraph. Also, the hyphenation has introduced a space after the hyphen, so that my suggestion won't work. Instead, put a space after the hyphen in the Find, so that it becomes
Find: - ^p
Replace:
(Still nothing)
I will try to find a way to join the lines of the original paragraphs together so that the text flows as intended (It may need some VBA coding).
Hi Graham,
Attached is a one-page Word doc and the one-page PDF file from which it was created (both posted under Fair Use from a 242-page copyrighted book). It's interesting that the PDF-to-Word conversion program (Nuance's Power PDF Advanced) pieced together some of the hyphenated words but not others. Regards, Joe
finalPdf-page14.doc
finalPdf-page14.pdf
Attached is a one-page Word doc and the one-page PDF file from which it was created (both posted under Fair Use from a 242-page copyrighted book). It's interesting that the PDF-to-Word conversion program (Nuance's Power PDF Advanced) pieced together some of the hyphenated words but not others. Regards, Joe
finalPdf-page14.doc
finalPdf-page14.pdf
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Find: ^31
Replace:
(Nothing)
removed some
please tell me if there is another find and replace
or
tell me if that is all that can be done
some-dashes-removed.docx
Replace:
(Nothing)
removed some
please tell me if there is another find and replace
or
tell me if that is all that can be done
some-dashes-removed.docx
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
-^31
returned 0 results in some-dashes-removed.docx
using microsoft word 2007
returned 0 results in some-dashes-removed.docx
using microsoft word 2007
You entered the wrong string. I'll say it again...please read it carefully this time. You should enter:
- ^13
To be clear, that's a normal hyphen (dash) followed by a normal space followed by a carat (Shift-6) and then the number 13.
You forgot the space after the hyphen and you entered 31 instead of 13.
- ^13
To be clear, that's a normal hyphen (dash) followed by a normal space followed by a carat (Shift-6) and then the number 13.
You forgot the space after the hyphen and you entered 31 instead of 13.
ASKER
I think all the
-
are gone
thanks
-
are gone
thanks
You're welcome. That's great news! Cheers, Joe
If, however, the lines are being terminated early by being divided into separate paragraphs, then you might be able to use Find and Replace
Find: -^p
Replace:
(Nothing)
If that doesn't work satisfactorily, can you post a sample document portion please?