Link to home
Start Free TrialLog in
Avatar of adamshields
adamshieldsFlag for United States of America

asked on

Parsing PDF or TXT document and output to XML

I'm looking to parse some PDF or text (Saving PDF as text file from acrobat reader) documents and output an XML file. I'm not really sure which language would be best suited to this task but I'm guessing it's possible with PERL, Ruby or Python...

Here are the documents I'm interested in parsing all of the fields except for the tabular data.:
http://iase.disa.mil/stigs/draft-stigs/draft_xenapp_stig_chklst_20090317.zip

Example, I guess the "key:" could be used to create each section. "Vulnerability Key: " is where each nest should begin.

Some Key: V0012345
Some ID: CTX0100
etc...

Becomes:

<Some>
 <Some_Key V0012345</Some_Key_>
  <Some_ID_> CTX0100</Some_ID_>
etc...
</Some>
Avatar of mrjoltcola
mrjoltcola
Flag of United States of America image

Do you wish to parse the actual PDF structure, or just form data?

For form data, itext is a good library to start with. There are Java and .NET ports for it.

A couple of years ago, we also used XPAAJ, which was provided by Adobe, but last time I checked it was no longer available.

If you use Livecycle Designer (Adobe Acrobat Professional) you can create XFA forms that have both an XML structure and the PDF structure, but it has some compatibility issues with some software. We use Acrobat 7 Static forms for best compatibility.

But I am thinking what you mean is the XDP data collection that is submitted/saved from an Adobe form and not the whole document itself? If it is simply the form data collection, you can parse the submitted XML (from email, file or web-service) via any of the mentioned languages with an XML parser. The acro form can be designed to submit the form data via various mechanism that do not require you to parse "PDF" itself.

Avatar of Adam314
Adam314

You can use the CAM::PDF module to read the PDF file, and get the text it contains.  You can then search for the pattern "Vulnerability Key" followed by a key, followed by "STID ID" followed by an ID.  When found, print the data.  Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
	$text .= $pdf->getPageText($_);
}
 
while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)/g) {
	print "<some><some_key>$1</some_key><some_id>$2</some_id></some>\n";
}

Open in new window

Avatar of adamshields

ASKER

@Adam314,

The $text looks for the phrase Vulnerability Key: and STIG ID: and takes both values and places them in some_key, and some id?

I have to do this for all of the values, Vulnerability Key: through Fixes: will the script have any problems skipping material such as MAC / Confidentiality Grid:

Would it be possible to modify the PDF version to handle a plain text file just as easily encase we find that method easier?
XenApp-Secure-Gateway-Server-VL0.txt
The first set is not being picked up. Secondly I attempted to add another variable and now the script does print anything but it also doesn't display errors so it's probably logical...
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
        $text .= $pdf->getPageText($_);
}
 
#while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)/g) {
while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)\s+Checks:\s*(\S+)/g) {
print "<vuln><vulnerability_key>$1</vulnerability_key><stig_id>$2</stig_id><checks>$3</checks></vuln>\n";
}

Open in new window

The CAM::PDF module will read in all a pdf file, and return all of the text (using the getPageText function).  If you have plain text files, you could skip this part, and just get the text directly.

The code I posted looks for "Vulnerability Key" followed by something (which goes in some_key in the XML), followed by "STIG ID", followed by something (which goes in some_id in the XML).

The Checks section appears to have a lot of text following it. Do you want all of that text, or just the next word, such as CTX0790?  The \S+ means 1 or more non-whitespace characters (whitespace is space, tab, carriage return, line feed, or form feed).  
I need to be able to handle the larger sections if possible?
I dropped the Checks section for the movement while experimenting but other sections have similar issues.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
        $text .= $pdf->getPageText($_);
}
 
while($text =~ /Vulnerability Key:\s*
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {
 
print "<Vuln>
<Vulnerability_Key_>$1</Vulnerability_Key_>
<STIG_ID>$2</STIG_ID_>
<Release_Number_>$3</Release_Number_>
<Status_>$4</Status_>
<Short_Name_>$5</Short_Name_>
<Long_Name_>$6</Long_Name_>
<IA_Controls_><IA_Control><ID>$7<ID></IA_Control></IA_Controls_>
<Categories_>$8</Categories_>
<Effective_Date_>$9</Effective_Date_>
<Condition_><subitem><title>$10</title><data></data></subitem></Condition_>
<Policy_>$11</Policy_>
</Vuln>\n";
}

Open in new window

For getting all of the key/values, it will be easier to use the text file, as the key is always at the beginning of a line, and followed with a colon.  With the pdf text, this isn't the case.  Can you get everything in text file format?

Even with the text file, it may be difficult.  Is it possible to change the program that generates the reports?
@Adam314, I see what you mean. Using a text document would be fine, for example the one attached.

We can not alter the output, that's the main reason I would like to figure it out.
XenApp-Secure-Gateway-Server-VL0.txt
Adam314,

Would you recommend that I post this to the Perl Beginners mailing list?
I'm trying to parse the TXT version since it may be a viable solution. Should I use a WHILE statement to open the FILE and and then FOREACH to parse each set of data? Or the other way around? Thanks
#!/usr/bin/perl
use strict;
use warnings;
 
open (FILE, 'XenApp_WebInterface_Server_VL04.txt');
 
while(<FILE>)
{
foreach($_ =~ /Vulnerability Key:\s*
- Hide quoted text -
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {
 
print "<Vuln>
<Vulnerability_Key_>$1</Vulnerability_Key_>
<STIG_ID>$2</STIG_ID_>
<Release_Number_>$3</Release_Number_>
<Status_>$4</Status_>
<Short_Name_>$5</Short_Name_>
<Long_Name_>$6</Long_Name_>
<IA_Controls_><IA_Control><ID>$7<ID></IA_Control></IA_Controls_>
<Categories_>$8</Categories_>
<Effective_Date_>$9</Effective_Date_>
<Condition_><subitem><title>$10</title><data></data></subitem></Condition_>
<Policy_>$11</Policy_>
</Vuln>\n";
}

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks!