[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 454
  • Last Modified:

Converting from pdf to text, preserving formatiing

We publish in our website pages from a magazine which is published every two months.
I am send the whole magazine in pdf format. I am finding the process of converting it into a suitable form for pasting on to a web page very laborious.

If I save an article from the pdf file as text, it has paragraph marks at the end of every line and the boldface is lost. Even with find and replace in Word using the special characters option it is terribly hit and miss and time absorbing.

Is there any other way of streamlining the process? The only formatting I have to preserve is boldface and the paragraphs.
1 Solution
Unfortunately within Office directly this cannot be done without some work.  You would need to create macros to change the most comoon things and then do the rest by hand.  The only other way to reduce the angst and accomplish this task is with a 3rd party PDF converter.

ABBYY PDF Transformer is the best that I have used.  They have a Try and Buy feature that allows you to use the full version for a trial period.

Others may have used another application.
The most scaleable and robust solution would be to save the pdf as plain text and apply CSS to format the web pages. Failing that, and looking to keep it as simple as possible, try selecting the text directly off of the pdf page with the acrobat reader select tool. (In the toolbar at the top) Once you have copied it, open microsoft word and under the 'edit' menu select, 'paste special'. Choose, 'RTF - rich text format' and most of your formatiting should carry over, (except columns and tables). If you want a software solution, try Adobe Acrobat which can save directly to an HTML file and does so very well. (expensive though).

PS - the paragraph marks may be from MS Word: go to Tools:Options on the Word menu bar, go to the view tab, about halfway down, you will see formatting marks visibility choices. Make sure 'All' is unchecked.
bogormanAuthor Commented:
Thanks to both of you.
The pdf to Word utility suggested by geneus is excellent but in fact emphaticdigital has solved my particluar problem in a simpler (and less expensive!) way.
'All' was checked. Having unchecked it, I then just did find and replace for two consecutive paragraph marks, replacing them with two line breaks. Then I replaced the remaining paragraphs with nothing and the text is almost right.
Thanks so much.
Have assigned the points to emphaticdigital. Hope you think this is fair.

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now