Solved

Extracting clean text from a PDF

Posted on 2010-08-18
2
593 Views
Last Modified: 2013-12-17
I have Adobe 9 Pro installed. Can I use the pdf library from its .dlls to access objects from a PDF?
What I want to accomplish is to programmatically (so not manually) get the output as in the attached 2927oc NTL_rulesOnly.txt from 2927oc NTL.pdf.

The PDF has all sorts of special characters that I’d like to skip.

Is there another tool that I can integrate into a Visual Studio project and which can do that? 0181749251.zip
0
Comment
Question by:mihaisz
2 Comments
 
LVL 8

Accepted Solution

by:
SylvainDrapeau earned 250 total points
Comment Utility
Hello !

Look at this here : http://itextpdf.com/

It should do what you need.

Check here for an example : http://www.codeproject.com/KB/cs/PDFToText.aspx

Syldra
0
 
LVL 7

Assisted Solution

by:DanSo1
DanSo1 earned 250 total points
Comment Utility
You don't need to use external libraries.
Just use clipboard. Simply programatically open your document in any version of Acrobat Reader, then select all, copy, paste to txt file.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Introduction In this tutorial, I'll explain how to create an animated progress meter in a wireframe prototype developed using Axure RP 7.0 - a leading prototyping tool for designing web sites and software. (For more information about Axure and gett…
This article focuses on how to remove password security from multiple PDF files by Adobe Acrobat program. Sometimes it is essential to access the stored data items and to print, edit as well as copy content from Portable Document Format files in abs…
The purpose of this video is to demonstrate how to Test the speed of a WordPress Website. Site Speed is an important metric of a site’s health. Slow site speed can result in viewers leaving your site quickly and not seeing your content. This…
We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now