Parsing PDF or TXT document and output to XML

I'm looking to parse some PDF or text (Saving PDF as text file from acrobat reader) documents and output an XML file. I'm not really sure which language would be best suited to this task but I'm guessing it's possible with PERL, Ruby or Python...

Here are the documents I'm interested in parsing all of the fields except for the tabular data.:
http://iase.disa.mil/stigs/draft-stigs/draft_xenapp_stig_chklst_20090317.zip

Example, I guess the "key:" could be used to create each section. "Vulnerability Key: " is where each nest should begin.

Some Key: V0012345
Some ID: CTX0100
etc...

Becomes:

<Some>
 <Some_Key V0012345</Some_Key_>
  <Some_ID_> CTX0100</Some_ID_>
etc...
</Some>
LVL 3
adamshieldsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

mrjoltcolaCommented:
Do you wish to parse the actual PDF structure, or just form data?

For form data, itext is a good library to start with. There are Java and .NET ports for it.

A couple of years ago, we also used XPAAJ, which was provided by Adobe, but last time I checked it was no longer available.

If you use Livecycle Designer (Adobe Acrobat Professional) you can create XFA forms that have both an XML structure and the PDF structure, but it has some compatibility issues with some software. We use Acrobat 7 Static forms for best compatibility.

But I am thinking what you mean is the XDP data collection that is submitted/saved from an Adobe form and not the whole document itself? If it is simply the form data collection, you can parse the submitted XML (from email, file or web-service) via any of the mentioned languages with an XML parser. The acro form can be designed to submit the form data via various mechanism that do not require you to parse "PDF" itself.

0
Adam314Commented:
You can use the CAM::PDF module to read the PDF file, and get the text it contains.  You can then search for the pattern "Vulnerability Key" followed by a key, followed by "STID ID" followed by an ID.  When found, print the data.  Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
	$text .= $pdf->getPageText($_);
}
 
while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)/g) {
	print "<some><some_key>$1</some_key><some_id>$2</some_id></some>\n";
}

Open in new window

0
adamshieldsAuthor Commented:
@Adam314,

The $text looks for the phrase Vulnerability Key: and STIG ID: and takes both values and places them in some_key, and some id?

I have to do this for all of the values, Vulnerability Key: through Fixes: will the script have any problems skipping material such as MAC / Confidentiality Grid:

Would it be possible to modify the PDF version to handle a plain text file just as easily encase we find that method easier?
XenApp-Secure-Gateway-Server-VL0.txt
0
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

adamshieldsAuthor Commented:
The first set is not being picked up. Secondly I attempted to add another variable and now the script does print anything but it also doesn't display errors so it's probably logical...
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
        $text .= $pdf->getPageText($_);
}
 
#while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)/g) {
while($text =~ /Vulnerability Key:\s*(\S+)\s+STIG ID:\s*(\S+)\s+Checks:\s*(\S+)/g) {
print "<vuln><vulnerability_key>$1</vulnerability_key><stig_id>$2</stig_id><checks>$3</checks></vuln>\n";
}

Open in new window

0
Adam314Commented:
The CAM::PDF module will read in all a pdf file, and return all of the text (using the getPageText function).  If you have plain text files, you could skip this part, and just get the text directly.

The code I posted looks for "Vulnerability Key" followed by something (which goes in some_key in the XML), followed by "STIG ID", followed by something (which goes in some_id in the XML).

The Checks section appears to have a lot of text following it. Do you want all of that text, or just the next word, such as CTX0790?  The \S+ means 1 or more non-whitespace characters (whitespace is space, tab, carriage return, line feed, or form feed).  
0
adamshieldsAuthor Commented:
I need to be able to handle the larger sections if possible?
0
adamshieldsAuthor Commented:
I dropped the Checks section for the movement while experimenting but other sections have similar issues.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use CAM::PDF;
 
my $pdf = CAM::PDF->new('XenApp_WebInterface_Server_VL04.pdf');
my $text;
foreach (1..$pdf->numPages) {
        $text .= $pdf->getPageText($_);
}
 
while($text =~ /Vulnerability Key:\s*
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {
 
print "<Vuln>
<Vulnerability_Key_>$1</Vulnerability_Key_>
<STIG_ID>$2</STIG_ID_>
<Release_Number_>$3</Release_Number_>
<Status_>$4</Status_>
<Short_Name_>$5</Short_Name_>
<Long_Name_>$6</Long_Name_>
<IA_Controls_><IA_Control><ID>$7<ID></IA_Control></IA_Controls_>
<Categories_>$8</Categories_>
<Effective_Date_>$9</Effective_Date_>
<Condition_><subitem><title>$10</title><data></data></subitem></Condition_>
<Policy_>$11</Policy_>
</Vuln>\n";
}

Open in new window

0
Adam314Commented:
For getting all of the key/values, it will be easier to use the text file, as the key is always at the beginning of a line, and followed with a colon.  With the pdf text, this isn't the case.  Can you get everything in text file format?

Even with the text file, it may be difficult.  Is it possible to change the program that generates the reports?
0
adamshieldsAuthor Commented:
@Adam314, I see what you mean. Using a text document would be fine, for example the one attached.

We can not alter the output, that's the main reason I would like to figure it out.
XenApp-Secure-Gateway-Server-VL0.txt
0
adamshieldsAuthor Commented:
Adam314,

Would you recommend that I post this to the Perl Beginners mailing list?
0
adamshieldsAuthor Commented:
I'm trying to parse the TXT version since it may be a viable solution. Should I use a WHILE statement to open the FILE and and then FOREACH to parse each set of data? Or the other way around? Thanks
#!/usr/bin/perl
use strict;
use warnings;
 
open (FILE, 'XenApp_WebInterface_Server_VL04.txt');
 
while(<FILE>)
{
foreach($_ =~ /Vulnerability Key:\s*
- Hide quoted text -
(\S+)\s+STIG ID:\s*
(\S+)\s+Release Number:\s*
(\S+)\s+Status:\s*
(\S+)\s+Short Name:\s*
(\S+)\s+Long Name:\s*
(\S+)\s+IA Controls:\s*
(\S+)\s+Categories:\s*
(\S+)\s+Effective Date:\s*
(\S+)\s+Condition:\s*
(\S+)\s+Policy:\s*
(\S+)/g) {
 
print "<Vuln>
<Vulnerability_Key_>$1</Vulnerability_Key_>
<STIG_ID>$2</STIG_ID_>
<Release_Number_>$3</Release_Number_>
<Status_>$4</Status_>
<Short_Name_>$5</Short_Name_>
<Long_Name_>$6</Long_Name_>
<IA_Controls_><IA_Control><ID>$7<ID></IA_Control></IA_Controls_>
<Categories_>$8</Categories_>
<Effective_Date_>$9</Effective_Date_>
<Condition_><subitem><title>$10</title><data></data></subitem></Condition_>
<Policy_>$11</Policy_>
</Vuln>\n";
}

Open in new window

0
Adam314Commented:

#!/usr/bin/perl
use strict;
use warnings;
 
my %record;
my $CurKey;
open(my $in, "<XenApp-Secure-Gateway-Server-VL0.txt") or die "Could not open file: $!\n";
while(<$in>) {
	if(/^(.*?):(.*)$/ and $1 ne 'http' and $1 ne 'https') {
		PrintRecord();
		$CurKey = $1;
		$record{$CurKey} = $2;
	}
	elsif($CurKey) {
		$record{$CurKey} .= $_;
	}
}
close($in);
PrintRecord();
 
 
sub PrintRecord {
	return unless %record;
	print "<Vuln>\n";
	while(my ($k, $v) = each %record) {
		$k =~ s/[^a-z]/_/ig;
		print "<$k>$v</$k>\n"
	}
	print "</Vuln>\n";
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
adamshieldsAuthor Commented:
Thanks!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.