Advertisement

11.05.2007 at 08:45AM PST, ID: 22939487
[x]
Attachment Details

Perl parsing PDF

Asked by saibsk in Perl Programming Language

Tags: perl, pdf, parse

I want to parse the data in a PDf file using Perl. I need to look for the string "Department" and retrieve the value assigned to it. This is the header of the PDF file. Then there are 4 columns in the PDF file. There are values in these columns I need to parse. I can use reg exp for the parse but how do I do that in PDF file

      Student id       credits      fee    scholarship

John
HarryStart Free Trial
[+][-]11.05.2007 at 09:45AM PST, ID: 20217877

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 11:15AM PST, ID: 20218546

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 12:00PM PST, ID: 20219003

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 12:58PM PST, ID: 20219420

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 01:20PM PST, ID: 20219601

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 01:26PM PST, ID: 20219658

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 01:31PM PST, ID: 20219695

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 01:43PM PST, ID: 20219775

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 01:55PM PST, ID: 20219869

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.05.2007 at 02:09PM PST, ID: 20219966

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]11.06.2007 at 07:22AM PST, ID: 20224561

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]11.06.2007 at 08:19AM PST, ID: 20225062

View this solution now by starting your 7-day free trial. Setting up your free trial is quick, easy, and secure. We will return you to this solution, unlocked, when you're done.

 

About this solution

Zone: Perl Programming Language
Tags: perl, pdf, parse
Sign Up Now!
Solution Provided By: saibsk
Participating Experts: 2
Solution Grade: B
 
 
[+][-]01.10.2008 at 02:25PM PST, ID: 20632048

Experts Exchange has a courteous staff of administrators who help members get the most out of the website by means of administrative comments like this one.

Start your 7-day free trial to view this Administrative Comment or ask the Experts your question.

 
 
Loading Advertisement...
Microsoft
  • Internet Protocols
  • Applications
  • Development
  • OS
  • Hardware
  • Windows Security
Apple
  • Operating Systems
  • Hardware
  • Programming
  • Networking
  • Software
Internet
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Spy / Ad Blockers
  • Web Browsers
  • New Net Users
  • Web Development
  • Chat / IM
  • Anti Spam
  • Web Servers
  • Anti-Virus
  • Email Clients
Gamers
  • Tips
  • Online / MMORPG
  • Puzzle
  • Emulators
  • Action / Adventure
  • Role Playing
  • Consoles
  • Game Programming
  • Strategy
  • Sports
  • Misc
  • Computer Games
Digital Living
  • Hardware
  • Automotive
  • New Net Users
  • New Users
  • Software
  • Digital Music
  • Gaming World
  • Home Security
  • Apple
  • Networking Hardware
Virus & Spyware
  • Vulnerabilities
  • IDS
  • Encryption
  • Anti-Virus
  • Operating Systems Security
  • Software Firewalls
  • WebApplications
  • Cell Phones
  • Operating Systems
  • Internet
  • Hardware Firewalls
Hardware
  • Displays / Monitors
  • Handhelds / PDAs
  • Components
  • Peripherals
  • Laptops/Notebooks
  • Servers
  • Misc
  • Apple
  • Embedded Hardware
  • Networking Hardware
  • Storage
  • Desktops
  • New Users
Software
  • System Utilities
  • Industry Specific
  • Network Management
  • Photos / Graphics
  • Page Layout
  • VMware
  • Misc
  • Web Development
  • OS
  • CYGWIN
  • Voice Recognition
  • Virtualization
  • Message Queue
  • Quality Assurance
  • Security
  • Firewalls
  • MultiMedia Applications
  • Development
  • Database
  • Office / Productivity
  • Business Management
  • OS/2 Apps
  • Server Software
  • Internet / Email
ITPro
  • OS
  • Storage
  • Encryption
  • Operating Systems Security
  • Apple Hardware
  • Laptops & Notebooks
  • Servers
  • Networking Hardware
  • Peripherals
  • Devices
  • Displays / Monitors
  • WebTrends / Stats
  • Search Engines
  • Firewalls
  • Web Computing
  • WebApplications
  • IDS
  • Vulnerabilities
  • Email Clients
  • File Sharing
  • Spy / Ad Blockers
  • Web Browsers
  • Web Servers
  • Networking
  • Anti-Virus
  • Consulting
  • Chat / IM
  • Anti Spam
Developer
  • Web Servers
  • Web Browsers
  • Game Programming
  • Dev Tools
  • Industry Specific
  • Office / Productivity
  • Database
  • CYGWIN
  • Web Development
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Programming
  • Content Management
  • Application Servers
  • Protocols
Storage
  • Removable Backup Media
  • Storage Technology
  • Servers
  • Grid
  • Remote Access
  • Backup / Restore
  • Misc
  • Hard Drives
OS
  • Miscellaneous
  • Security
  • Development
  • Linux
  • VMware
  • MainFrame OS
  • Unix
  • Apple
  • OS / 2
  • AS / 400
  • BeOS
  • Microsoft
  • VMS / OpenVMS
Database
  • Oracle
  • Miscellaneous
  • MySQL
  • Software
  • Sybase
  • Contact Management
  • PostgreSQL
  • Data Manipulation
  • Clarion
  • InterSystems Cache
  • Siebel
  • MUMPS
  • OLAP
  • SQLBase
  • SAS
  • GIS & GPS
  • 4GL
  • Berkeley DB
  • DB2
  • Informix
  • Interbase / Firebird
  • FoxPro
  • Reporting
  • LDAP
  • Filemaker Pro
  • MS SQL Server
  • dBase
  • MS Access
Security
  • Misc
  • Web Browsers
  • Software Firewalls
  • Operating Systems Security
  • File Sharing
  • Spy / Ad Blockers
  • Vulnerabilities
  • WebApplications
  • IDS
  • Anti-Virus
  • Encryption
  • Anti Spam
  • Email Clients
  • VPN
  • Chat / IM
Programming
  • Editors IDEs
  • Installation
  • Handhelds / PDAs
  • Multimedia Programming
  • System / Kernel
  • Automation
  • Algorithms
  • Game
  • Signal Processing
  • Project Management
  • Open Source
  • Database
  • Misc
  • Languages
  • Processor Platforms
  • Theory
Web Development
  • Scripting
  • Blogs
  • Web Servers
  • Software
  • Search Engines
  • Web Graphics
  • Web Services
  • Images
  • Internet Marketing
  • Images and Photos
  • Components
  • Document Imaging
  • Web Languages/Standards
  • Illustration
  • WebApplications
  • Fonts
  • WebTrends / Stats
  • Authoring
  • Digital Camera Software
  • Miscellaneous
Networking
  • Protocols
  • Apple Networking
  • Network Management
  • Message Queue
  • Application Servers
  • Content Management
  • File Servers
  • Email Servers
  • Misc
  • Java Editors & IDEs
  • Wireless
  • Networking Hardware
  • Backup / Restore
  • System Utilities
  • ISPs & Hosting
  • Web Servers
  • Storage Technology
  • Removable Backup Media
  • Servers
  • Web Computing
  • Broadband
  • Grid
  • OS / 2
  • Novell Netware
  • Unix Networking
  • Windows Networking
  • Security
  • Telecommunications
  • Operating Systems
  • Linux Networking
Other
  • Lounge
  • Business Travel
  • Community Support
  • New Net Users
  • Philosophy / Religion
  • Math / Science
  • Miscellaneous
  • URLs
  • Expert Lounge
  • Politics
  • Puzzles / Riddles
  • Automotive
Community Support
  • Suggestions
  • New to EE
  • New Topics
  • CleanUp
  • Announcements
  • General
  • Feedback
  • Input
  • EE Bugs
 
11.05.2007 at 09:45AM PST, ID: 20217877
That will depend on how the data is setup in the PDF file.  Can you post the PDF file somewhere?
 
11.05.2007 at 11:15AM PST, ID: 20218546
It has this format.

 Department:

            Student id       credits      fee    scholarship

John
Harry

But I can't post the PDF file data. I need the parse the above data in the PDF file.
 
11.05.2007 at 12:00PM PST, ID: 20219003
Without seeing the actual data, it will be hard to help.  Anyways, here is something that should get you going.


use CAM::PDF;

my $pdf = CAM::PDF->new('test1.pdf');
my $page1 = $pdf->getPageContent(1);  #or whatever page you need

my @lines=split(/\n/, $page1);
my $names=0;
foreach (@lines) {
      if(/Student id\s+credits\s+fee\s+scholarship/) {
            $names=1;
      }
      elsif($names==1) {
            if(/(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/) {
                  print "Name=$1\n";
                  print "ID=$2\n";
                  print "Credits=$3\n";
                  print "Fee=$4\n";
                  print "Scholarship=$5\n";
            }
      }
}

 
11.05.2007 at 12:58PM PST, ID: 20219420
But what if there is more than 1 page in a PDF document?
 
11.05.2007 at 01:20PM PST, ID: 20219601
Does each page have the "Student id       credits      fee    scholarship" header, or is that only listed once?

You can get the number of pages, and loop through all of them with:
my $pages = $pdf->numPages();
for(1..$pages) {
    my $page = $pdf->getPageText($_);
    .....
}


 
11.05.2007 at 01:26PM PST, ID: 20219658
#!/usr/bin/perl
      
      use CAM::PDF;
      
      $fileName = 'test.pdf';
      
      print "File: $fileName";
      
      my $pdf = CAM::PDF->new($filename);
      
      print "PDF:$pdf";
      
      
      my $page1 = $pdf->getPageContent(1);

When I execute this code it says cannot call the method getPageContent on undefined variable. Additionally I tried print the $pdf that is empty.

THe headers are listed in each page.
 
11.05.2007 at 01:31PM PST, ID: 20219695
what is the output from this:
...
my $pdf = CAM::PDF->new($filename) or die "Could not create CAM::PDF:\n  $!\n  $@\n";
...
 
11.05.2007 at 01:43PM PST, ID: 20219775
It says no such file or directory. I am in the /export/home/user/Students directory. My perl script and the pdf file are both located in the Students dir. I tried both  giving the full path to the file and then executing and just with the file name. Gives the same error.
 
11.05.2007 at 01:55PM PST, ID: 20219869
Do you have proper permissions to on the directory and file?

my $fileName = 'test.pdf';
die "File does not exist\n" unless -e $fileName;
die "File is not readable\n" unless -r $fileName;
my $pdf = CAM::PDF->new($filename);
die "Could not create CAM::PDF:\n  $!\n  $@\n" unless $pdf;
 
11.05.2007 at 02:09PM PST, ID: 20219966
Use of uninitialized value in pattern match (m//) at /usr/perl5/site_perl/5.8.4/CAM/PDF.pm line 293.
Use of uninitialized value in length at /usr/perl5/site_perl/5.8.4/CAM/PDF.pm line 303.
Use of uninitialized value in string eq at /usr/perl5/site_perl/5.8.4/CAM/PDF.pm line 306.
Use of uninitialized value in open at /usr/perl5/site_perl/5.8.4/CAM/PDF.pm line 320.
Use of uninitialized value in concatenation (.) or string at /usr/perl5/site_perl/5.8.4/CAM/PDF.pm line 322.
Could not create CAM::PDF:
  No such file or directory

prints the above
 
11.06.2007 at 07:22AM PST, ID: 20224561
Maybe the CAM::PDF module isn't installed properly, or you have an old version.  Try upgrading to the latest, or reinstalling.
 
11.06.2007 at 08:19AM PST, ID: 20225062
I got the pdftotext tool installed on my system. For now I am able to convert the file to text.
Accepted Solution
 
01.10.2008 at 02:25PM PST, ID: 20632048
A request has been made in Community Support to close this question:
http://www.experts-exchange.com/Q_23073853.html

If there are no objections, a moderator will finalize this question in approximately 4 days as follows:
PAQ with refund using {http:#a20225062}

Please leave any recommendations here.

Vee_Mod
Community Support Moderator
 
 
01.14.2008 at 03:38PM PST, ID: 20658465
Closed, 500 points refunded.
Vee_Mod
Community Support Moderator
 
 
 
20080716-EE-VQP-32 / EE_QW_2_20070628