Link to home
Start Free TrialLog in
Avatar of Advait Kawthalkar
Advait KawthalkarFlag for India

asked on

How to convert a PDF file into a TXT file which has a predefined format.

I receive certain information from my clients in a pdf file. We need to feed this info to our ERP system. The system allow import of txt file but that txt file should be in a particular format only. At present we are punching in data manually. If I can device something which can import or export the data in PDF into an txt file maintaining the format or template or style as it is required from my system that would be great great help.

Here is what I want to do. Convert a PDF into TXT > Import that TXT in to our System. ( The catch is that the txt file should be in a standard format. I have the specification of the standard format which is required). Can this be automated.

Let me know if you need any more info.

Awaiting your comments and suggestions.

Thanks
Myinfo.
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I would say the easier method would be to get it done at the source. See how the user/client is exporting the data from their system and see if they can export it for you in a standard system that is useful to you.

Normally programs will have export to CSV etc which can be imported by most applications.
I do something similar at a client site -- extract text and then parse each text file with regular expressions.  I'm currently using a version of PDFtoText, mentioned by Joe, that might be from foolabs, PDF Technologies, or some other software vendor.

I am going to migrate that workstation to a virtual server soon and may take the opportunity to explore alternatives.  If I do explore alternatives, I will probably look at Powershell.  Using the itextsharp assembly, from sourceforge, it has the potential to do both text extraction an text parsing.

I might also consider a Python solution.
Hello myinfo,

You can convert PDF files into TXT by two utilities, Adobe Reader & Third Party Tool.

1. Download & Run Adobe Reader: http://get.adobe.com/reader/

Now follow below steps:

1.1. Open PDF File in Adobe Reader and Click on File >> Save As Other... >> Text...

User generated image
1.2. Save the file as text and you will be able to convert your file from PDF to TXT.

2. Download and Run PDFWARE PDF TOOLBOX: http://www.pdfware.org/pdf-toolbox.html - Third Party Tool

Now follow below steps:

2.1. Open PDF TOOLBOX Click on Extract Text;

2.2. Now add your file by clicking on Add file (s) button;

2.3 Choose your file destination where you want to save the file;

2.3 Choose your setting and click on Next button;

2.4 Now your file will be successfully extracted.

Note: PDFWARE PDF Toolbox software is working perfect for me I have personally tried it, that's why I am suggesting you. By this I am not promoting any Software or Company.

Thanks
Edwin
@Edwin

Can that be run from a batch file or command line?  In other words, is there a non-GUI interface to running that utility?
The software supports batch file conversion, you can add folder if you have multiple PDF files.

About GUI - There is not command line to run the software. But GUI is very understandable, if you wants to try the software then i would suggest you to try its demo version. If it satisfy you then can purchase the software. The software will cost you only $69 for Personal Licence.

Thanks
Edwin
@Edwin
Can this be automated.
Given this requirement in the question, how can the PDFWARE PDF TOOLBOX be automated without a command line interface?
@aikimark

The process is the automatic, you just have to select the file(s) only and click on the extract button.

I don't think this can be done by any automatic process, atleast a user have to select the file and the same thing have to follow in the software also.
@Edwin

The problem posed stated that PDF files would be received in an on-going basis and there is a need to replace the manual entry (or copy/paste) of data.  If the PDFWARE PDF TOOLBOX can not be automated (no command line interface), then it is not a solution to the stated problem.
Avatar of Advait Kawthalkar

ASKER

Hello Joe,

Thank for the suggestion. Can this tool export the txt in a typical layout or template because that is what all matters.

My info
No this tool convert to a normal .txt file. Also if you wants a typical layout or template then you can convert your PDF document to word, because .txt don't support any layout or template.
@Edwin

The question was addressed to Joe Winograd.

==============
@Myinfo

I have used or tested a couple of the utilities referenced in Joe's comment.  The answer is yes.  At least one of them, PdfToText, has a couple of text output formatting options.  Two of these options resemble the layout of the PDF document.
Am sorry guys but I found a product called paperport which did the job. Thank you for your efforts n sorry
Hi myinfo,

I've been using PaperPort for around 20 years, ever since V3. I'm currently on the latest version, 14.5 Pro. I've written numerous articles about it here at EE that you may find helpful now that you're using it:

PaperPort Upgrade: How to download and install updated versions of PaperPort 11 and 12
https://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html

Automatic Duplex Scanning in PaperPort Versions 11, 12, 14
https://www.experts-exchange.com/Software/Server_Software/Document_Management/A_10331-Automatic-Duplex-Scanning-in-PaperPort-Versions-11-12-14.html

PaperPort - Blank Page Job Separator with Duplex Scanning
https://www.experts-exchange.com/Software/Server_Software/Document_Management/A_11344-PaperPort-Blank-Page-Job-Separator-with-Duplex-Scanning.html

PaperPort - How To Achieve More Than Five Scanning Profiles in the Standard Edition
https://www.experts-exchange.com/Web_Development/Document_Imaging/A_12864-PaperPort-How-To-Achieve-More-Than-Five-Scanning-Profiles-in-the-Standard-Edition.html

PaperPort - How To Reorder/Rearrange Scanning Profiles
https://www.experts-exchange.com/Web_Development/Document_Imaging/A_12875-PaperPort-How-To-Reorder-Rearrange-Scanning-Profiles.html

Published a couple of 5-minute video Micro Tutorials that may also be helpful:

Document Imaging: PaperPort Send To Bar - Part 1
https://www.experts-exchange.com/VP_207.html

Document Imaging: PaperPort Send To Bar - Part 2
https://www.experts-exchange.com/VP_208.html

But all of that said, I actually do not understand how it solved the problem as you described it. Are you using it to make what PaperPort calls a PDF Searchable Image file? Or to make a Word file, perhaps via drag-and-drop onto the Word icon on the Send To Bar? I'm very interested to hear how PaperPort is the answer to your problem. Thanks, Joe