• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 308
  • Last Modified:

How to convert a PDF file into a TXT file which has a predefined format.

I receive certain information from my clients in a pdf file. We need to feed this info to our ERP system. The system allow import of txt file but that txt file should be in a particular format only. At present we are punching in data manually. If I can device something which can import or export the data in PDF into an txt file maintaining the format or template or style as it is required from my system that would be great great help.

Here is what I want to do. Convert a PDF into TXT > Import that TXT in to our System. ( The catch is that the txt file should be in a standard format. I have the specification of the standard format which is required). Can this be automated.

Let me know if you need any more info.

Awaiting your comments and suggestions.

Thanks
Myinfo.
0
Advait Kawthalkar
Asked:
Advait Kawthalkar
  • 5
  • 4
  • 2
  • +2
1 Solution
 
Joe Winograd, Fellow&MVEDeveloperCommented:
You may be able to do what you need with a tool called Xpdf:
http://www.foolabs.com/xpdf/

It is a set of eight command line executables. The only one you'll need is <pdftotext.exe>, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html

To learn how to download and install it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
http://www.experts-exchange.com/Web_Development/Document_Imaging/VP_213.html

That is Part 1 of a 3-part series. Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, which is PDFtoText (Part 2 of the series is not relevant for you):
http://www.experts-exchange.com/Web_Development/Document_Imaging/VP_217.html

For your purposes, you should use the -layout option when you call PDFtoText, because it maintains as much as possible the original physical layout of the text. Regards, Joe
0
 
Scott ThomsonCommented:
I would say the easier method would be to get it done at the source. See how the user/client is exporting the data from their system and see if they can export it for you in a standard system that is useful to you.

Normally programs will have export to CSV etc which can be imported by most applications.
0
 
aikimarkCommented:
I do something similar at a client site -- extract text and then parse each text file with regular expressions.  I'm currently using a version of PDFtoText, mentioned by Joe, that might be from foolabs, PDF Technologies, or some other software vendor.

I am going to migrate that workstation to a virtual server soon and may take the opportunity to explore alternatives.  If I do explore alternatives, I will probably look at Powershell.  Using the itextsharp assembly, from sourceforge, it has the potential to do both text extraction an text parsing.

I might also consider a Python solution.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
Edwin HofferTechnical ExpertCommented:
Hello myinfo,

You can convert PDF files into TXT by two utilities, Adobe Reader & Third Party Tool.

1. Download & Run Adobe Reader: http://get.adobe.com/reader/

Now follow below steps:

1.1. Open PDF File in Adobe Reader and Click on File >> Save As Other... >> Text...

Save As Other
1.2. Save the file as text and you will be able to convert your file from PDF to TXT.

2. Download and Run PDFWARE PDF TOOLBOX: http://www.pdfware.org/pdf-toolbox.html - Third Party Tool

Now follow below steps:

2.1. Open PDF TOOLBOX Click on Extract Text;

2.2. Now add your file by clicking on Add file (s) button;

2.3 Choose your file destination where you want to save the file;

2.3 Choose your setting and click on Next button;

2.4 Now your file will be successfully extracted.

Note: PDFWARE PDF Toolbox software is working perfect for me I have personally tried it, that's why I am suggesting you. By this I am not promoting any Software or Company.

Thanks
Edwin
0
 
aikimarkCommented:
@Edwin

Can that be run from a batch file or command line?  In other words, is there a non-GUI interface to running that utility?
0
 
Edwin HofferTechnical ExpertCommented:
The software supports batch file conversion, you can add folder if you have multiple PDF files.

About GUI - There is not command line to run the software. But GUI is very understandable, if you wants to try the software then i would suggest you to try its demo version. If it satisfy you then can purchase the software. The software will cost you only $69 for Personal Licence.

Thanks
Edwin
0
 
aikimarkCommented:
@Edwin
Can this be automated.
Given this requirement in the question, how can the PDFWARE PDF TOOLBOX be automated without a command line interface?
0
 
Edwin HofferTechnical ExpertCommented:
@aikimark

The process is the automatic, you just have to select the file(s) only and click on the extract button.

I don't think this can be done by any automatic process, atleast a user have to select the file and the same thing have to follow in the software also.
0
 
aikimarkCommented:
@Edwin

The problem posed stated that PDF files would be received in an on-going basis and there is a need to replace the manual entry (or copy/paste) of data.  If the PDFWARE PDF TOOLBOX can not be automated (no command line interface), then it is not a solution to the stated problem.
0
 
Advait KawthalkarSr. Manager ITAuthor Commented:
Hello Joe,

Thank for the suggestion. Can this tool export the txt in a typical layout or template because that is what all matters.

My info
0
 
Edwin HofferTechnical ExpertCommented:
No this tool convert to a normal .txt file. Also if you wants a typical layout or template then you can convert your PDF document to word, because .txt don't support any layout or template.
0
 
aikimarkCommented:
@Edwin

The question was addressed to Joe Winograd.

==============
@Myinfo

I have used or tested a couple of the utilities referenced in Joe's comment.  The answer is yes.  At least one of them, PdfToText, has a couple of text output formatting options.  Two of these options resemble the layout of the PDF document.
0
 
Advait KawthalkarSr. Manager ITAuthor Commented:
Am sorry guys but I found a product called paperport which did the job. Thank you for your efforts n sorry
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi myinfo,

I've been using PaperPort for around 20 years, ever since V3. I'm currently on the latest version, 14.5 Pro. I've written numerous articles about it here at EE that you may find helpful now that you're using it:

PaperPort Upgrade: How to download and install updated versions of PaperPort 11 and 12
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html

Automatic Duplex Scanning in PaperPort Versions 11, 12, 14
http://www.experts-exchange.com/Software/Server_Software/Document_Management/A_10331-Automatic-Duplex-Scanning-in-PaperPort-Versions-11-12-14.html

PaperPort - Blank Page Job Separator with Duplex Scanning
http://www.experts-exchange.com/Software/Server_Software/Document_Management/A_11344-PaperPort-Blank-Page-Job-Separator-with-Duplex-Scanning.html

PaperPort - How To Achieve More Than Five Scanning Profiles in the Standard Edition
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12864-PaperPort-How-To-Achieve-More-Than-Five-Scanning-Profiles-in-the-Standard-Edition.html

PaperPort - How To Reorder/Rearrange Scanning Profiles
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12875-PaperPort-How-To-Reorder-Rearrange-Scanning-Profiles.html

Published a couple of 5-minute video Micro Tutorials that may also be helpful:

Document Imaging: PaperPort Send To Bar - Part 1
http://www.experts-exchange.com/VP_207.html

Document Imaging: PaperPort Send To Bar - Part 2
http://www.experts-exchange.com/VP_208.html

But all of that said, I actually do not understand how it solved the problem as you described it. Are you using it to make what PaperPort calls a PDF Searchable Image file? Or to make a Word file, perhaps via drag-and-drop onto the Word icon on the Send To Bar? I'm very interested to hear how PaperPort is the answer to your problem. Thanks, Joe
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

  • 5
  • 4
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now