Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

How to convert a PDF file into a TXT file which has a predefined format.

Posted on 2014-08-15
14
Medium Priority
?
302 Views
Last Modified: 2014-10-11
I receive certain information from my clients in a pdf file. We need to feed this info to our ERP system. The system allow import of txt file but that txt file should be in a particular format only. At present we are punching in data manually. If I can device something which can import or export the data in PDF into an txt file maintaining the format or template or style as it is required from my system that would be great great help.

Here is what I want to do. Convert a PDF into TXT > Import that TXT in to our System. ( The catch is that the txt file should be in a standard format. I have the specification of the standard format which is required). Can this be automated.

Let me know if you need any more info.

Awaiting your comments and suggestions.

Thanks
Myinfo.
0
Comment
Question by:Advait Kawthalkar
  • 5
  • 4
  • 2
  • +2
14 Comments
 
LVL 57

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40264639
You may be able to do what you need with a tool called Xpdf:
http://www.foolabs.com/xpdf/

It is a set of eight command line executables. The only one you'll need is <pdftotext.exe>, which converts PDF files to plain text. You may download the package here:
http://www.foolabs.com/xpdf/download.html

To learn how to download and install it, I suggest watching Xpdf - Command Line Utility for PDF Files - Part 1, a 5-minute Experts Exchange video Micro Tutorial:
http://www.experts-exchange.com/Web_Development/Document_Imaging/VP_213.html

That is Part 1 of a 3-part series. Another 5-minute EE video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, explains the specific tool you need, which is PDFtoText (Part 2 of the series is not relevant for you):
http://www.experts-exchange.com/Web_Development/Document_Imaging/VP_217.html

For your purposes, you should use the -layout option when you call PDFtoText, because it maintains as much as possible the original physical layout of the text. Regards, Joe
0
 
LVL 10

Expert Comment

by:Scott Thomson
ID: 40264673
I would say the easier method would be to get it done at the source. See how the user/client is exporting the data from their system and see if they can export it for you in a standard system that is useful to you.

Normally programs will have export to CSV etc which can be imported by most applications.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40264679
I do something similar at a client site -- extract text and then parse each text file with regular expressions.  I'm currently using a version of PDFtoText, mentioned by Joe, that might be from foolabs, PDF Technologies, or some other software vendor.

I am going to migrate that workstation to a virtual server soon and may take the opportunity to explore alternatives.  If I do explore alternatives, I will probably look at Powershell.  Using the itextsharp assembly, from sourceforge, it has the potential to do both text extraction an text parsing.

I might also consider a Python solution.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 15

Expert Comment

by:Edwin Hoffer
ID: 40264707
Hello myinfo,

You can convert PDF files into TXT by two utilities, Adobe Reader & Third Party Tool.

1. Download & Run Adobe Reader: http://get.adobe.com/reader/

Now follow below steps:

1.1. Open PDF File in Adobe Reader and Click on File >> Save As Other... >> Text...

Save As Other
1.2. Save the file as text and you will be able to convert your file from PDF to TXT.

2. Download and Run PDFWARE PDF TOOLBOX: http://www.pdfware.org/pdf-toolbox.html - Third Party Tool

Now follow below steps:

2.1. Open PDF TOOLBOX Click on Extract Text;

2.2. Now add your file by clicking on Add file (s) button;

2.3 Choose your file destination where you want to save the file;

2.3 Choose your setting and click on Next button;

2.4 Now your file will be successfully extracted.

Note: PDFWARE PDF Toolbox software is working perfect for me I have personally tried it, that's why I am suggesting you. By this I am not promoting any Software or Company.

Thanks
Edwin
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40265012
@Edwin

Can that be run from a batch file or command line?  In other words, is there a non-GUI interface to running that utility?
0
 
LVL 15

Expert Comment

by:Edwin Hoffer
ID: 40267225
The software supports batch file conversion, you can add folder if you have multiple PDF files.

About GUI - There is not command line to run the software. But GUI is very understandable, if you wants to try the software then i would suggest you to try its demo version. If it satisfy you then can purchase the software. The software will cost you only $69 for Personal Licence.

Thanks
Edwin
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40267239
@Edwin
Can this be automated.
Given this requirement in the question, how can the PDFWARE PDF TOOLBOX be automated without a command line interface?
0
 
LVL 15

Expert Comment

by:Edwin Hoffer
ID: 40267385
@aikimark

The process is the automatic, you just have to select the file(s) only and click on the extract button.

I don't think this can be done by any automatic process, atleast a user have to select the file and the same thing have to follow in the software also.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40267392
@Edwin

The problem posed stated that PDF files would be received in an on-going basis and there is a need to replace the manual entry (or copy/paste) of data.  If the PDFWARE PDF TOOLBOX can not be automated (no command line interface), then it is not a solution to the stated problem.
0
 

Author Comment

by:Advait Kawthalkar
ID: 40269484
Hello Joe,

Thank for the suggestion. Can this tool export the txt in a typical layout or template because that is what all matters.

My info
0
 
LVL 15

Expert Comment

by:Edwin Hoffer
ID: 40269492
No this tool convert to a normal .txt file. Also if you wants a typical layout or template then you can convert your PDF document to word, because .txt don't support any layout or template.
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40269882
@Edwin

The question was addressed to Joe Winograd.

==============
@Myinfo

I have used or tested a couple of the utilities referenced in Joe's comment.  The answer is yes.  At least one of them, PdfToText, has a couple of text output formatting options.  Two of these options resemble the layout of the PDF document.
0
 

Author Comment

by:Advait Kawthalkar
ID: 40367673
Am sorry guys but I found a product called paperport which did the job. Thank you for your efforts n sorry
0
 
LVL 57

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 40368364
Hi myinfo,

I've been using PaperPort for around 20 years, ever since V3. I'm currently on the latest version, 14.5 Pro. I've written numerous articles about it here at EE that you may find helpful now that you're using it:

PaperPort Upgrade: How to download and install updated versions of PaperPort 11 and 12
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html

Automatic Duplex Scanning in PaperPort Versions 11, 12, 14
http://www.experts-exchange.com/Software/Server_Software/Document_Management/A_10331-Automatic-Duplex-Scanning-in-PaperPort-Versions-11-12-14.html

PaperPort - Blank Page Job Separator with Duplex Scanning
http://www.experts-exchange.com/Software/Server_Software/Document_Management/A_11344-PaperPort-Blank-Page-Job-Separator-with-Duplex-Scanning.html

PaperPort - How To Achieve More Than Five Scanning Profiles in the Standard Edition
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12864-PaperPort-How-To-Achieve-More-Than-Five-Scanning-Profiles-in-the-Standard-Edition.html

PaperPort - How To Reorder/Rearrange Scanning Profiles
http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12875-PaperPort-How-To-Reorder-Rearrange-Scanning-Profiles.html

Published a couple of 5-minute video Micro Tutorials that may also be helpful:

Document Imaging: PaperPort Send To Bar - Part 1
http://www.experts-exchange.com/VP_207.html

Document Imaging: PaperPort Send To Bar - Part 2
http://www.experts-exchange.com/VP_208.html

But all of that said, I actually do not understand how it solved the problem as you described it. Are you using it to make what PaperPort calls a PDF Searchable Image file? Or to make a Word file, perhaps via drag-and-drop onto the Word icon on the Send To Bar? I'm very interested to hear how PaperPort is the answer to your problem. Thanks, Joe
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
Computer science students often experience many of the same frustrations when going through their engineering courses. This article presents seven tips I found useful when completing a bachelors and masters degree in computing which I believe may he…
Loops Section Overview
Screencast - Getting to Know the Pipeline

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question