adobe acrobat export character capitalization and spacing issue

Posted on 2010-01-07
Last Modified: 2013-12-02
I am trying to export a pdf document in acrrobat pro and when I try doing it in HTML, xml, etc the capitalization and spacing of words are all wrong.  Yet, it shows correctly on the page.

FOR Example.  a TITLE:
Self-Limiting Growth of Metal Oxide Thin Films Using Pulsed PECVD

Exports as
Self-limiting Growth of Metal oxide thin films using Pulsed PeCVd

Even just a simple copy and paste into a text editor from the document does the same thing.  

Also words get clumped together.  

for example:  I like to run over to the river and sit.

ends up being:

Iliketorunover to the river and sit.
I've got about 1000 documents that I need to parse text out so i'm trying to get these in some kind of format so that I can parse the text.  

Can someone advise why this is doing this and if there is anything I can do to resolve it


FYI The meta data is unusable as it only has part of the text that I actually need.  Looking in the document properties the application is Adobe InDesign CS2.

I've attached the XMP data of the doc in case that gives any idea as to the issues.
<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:56:37        ">

   <rdf:RDF xmlns:rdf="">

      <rdf:Description rdf:about=""





         <xmp:CreatorTool>Adobe InDesign CS2 (4.0)</xmp:CreatorTool>


      <rdf:Description rdf:about=""





               <rdf:li xml:lang="x-default">Self-Limiting Growth of Metal Oxide Thin Films Using Pulsed PECVD</rdf:li>





               <rdf:li>C.A. Wolden</rdf:li>




      <rdf:Description rdf:about=""





      <rdf:Description rdf:about=""


         <pdf:Producer>Adobe PDF Library 7.0</pdf:Producer>


      <rdf:Description rdf:about=""



























<?xpacket end="w"?>

Open in new window

Question by:polobruce
    1 Comment
    LVL 44

    Accepted Solution

    Unfortunately you are out of luck: In order to extract textual information from a PDF, the PDF needs to contain a "ToUnicode" mapping table that allows the extractor to take a glyph code (which references the character that's drawn on the PDF page) and maps it back to a real character in your extracted text. That table is corrupt in your PDF document.
    Your best chance to extract all the data is to use an OCR program (e.g. Abbyy's FineReader) to recover that information.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Suggested Solutions

    I. Introduction In a previous article ( (now deprecated), I discussed how to upgrad…
    This article focuses on how to remove password security from multiple PDF files by Adobe Acrobat program. Sometimes it is essential to access the stored data items and to print, edit as well as copy content from Portable Document Format files in abs…
    In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
    In this sixth video of the Xpdf series, we discuss and demonstrate the PDFtoPNG utility, which converts a multi-page PDF file to separate color, grayscale, or monochrome PNG files, creating one PNG file for each page in the PDF. It does this via a c…

    761 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    7 Experts available now in Live!

    Get 1:1 Help Now