The ability to edit PDF documents can be useful, however it may not be a straight forward process. Many non-technical people don't realise that a PDF document is basically an image rather than a text file, even if it contains nothing but text.
If the PDF document was created via tools in MS Word or similar, then a simple copy and paste should get you the text and most formatting. However if the PDF was a scanned document or created from a bitmap image, that option is not available.
At times you may also receive a protected PDF that doesn't allow copying of text but does allow printing.
You may also have an old version of Adobe Reader that doesn't have the tool to copy text, and for various reasons you cannot perform the update.
In these cases to extract the text from a PDF document, you need to perform Optical Character Recognition (OCR). The level of success with doing this depends on the quality of the original document you are converting, and the quality of the OCR software.
Products such as Adobe Acrobat Professional do have quality OCR and the Export tool is specifically designed to convert PDF into a Word document or other format. This software doesn't come cheap, however there are some alternatives. Depending on the level of security on your network, you may not be able to install the free alternatives. Before seeking your IT department's assistance, you can utilise software they most likely have already provided you.
This article makes the assumption that you have Adobe Reader 7 or higher and Microsoft Office 2003 or higher installed. Most businesses have these available for all staff.
1. Open the PDF document and go to File > Print
2. Chose Microsoft Office Document Image Writer from the list of printers.
Make any other changes required such as pages to print.
Click OK. This will create a TIF file, which is required in the next steps.
3. Choose a location to save the file.
The Desktop is the preferred location for simplicity.
4. Load the MS Office Document Imaging software.
This can be launched by double clicking on C:\Program Files\Common Files\Microsoft Shared\MODI\11.0\MSPVIEW.E
Depending on your version of Word, the exact location may differ. For Office 2003 it is ...\MODI\11.0. For Office 2007 it is ...\MODI\12.0. The number should be the only folder in that directory unless you have multiple versions of Office installed.
You then need to open the file saved from the previous step.
5. Select Tools > Send Text to Word.
6. Retain the default values. The default folder can be changed to anything you like. This file will be the Word document and will have the same name as you selected in step 3.
7. Click OK on the confirmation message
8. The program will process the request.
9. Once complete MS Word will open with the text for editing. Edit and save as necessary. Here is a sample of the source PDF followed by the output.
As you can see, this process is not perfect and your results will vary. Formatting may be lost and images may not transfer, however any text on the images such as logos or letterheads will. The text is either presented in tables and left aligned, or as html. This differs on the version of Word you have installed.
This method is useful for obtaining the text from a scanned document to then be reformatted into a different layout. If you want an exact conversion with formatting and images intact, then you will require other OCR software which will usually entail an investment.
While there are several free versions that do an adequate job - often better than the above - they are freeware and many IT departments won't allow them on the network. Adobe Acrobat Standard offers a nice halfway point between Reader and Professional in both price and features. I would recommend that product if you require better text recognition.