• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 756
  • Last Modified:

Convert PDF File to XML with asp.net

I need to do the following conversion.
Person uploads pdf file, my programs resaves this file as xml. and then from xml it uploads fields to the database...

I think this the only way to upload fields from pdf file to MS SQL database.
I prefer to use ASp.net
How do i accomplish it?  Any one knows where to start?
0
maximyshka
Asked:
maximyshka
  • 13
  • 11
1 Solution
 
Karl Heinz KremerCommented:
Do you want to convert the complete PDF file to XML (e.g. content extraction), or just the data from interactive form fields?
When you write "uploads PDF file", do you mean the whole PDF file, or just the form data?

It makes it big difference if you are only interested in form fields vs. the whole content of your PDF file.

If it's just the forms data, you can e.g. use the XFDF forms submission method, and therefore would never have to deal with the complete PDF file (just the form data in XML format).

Please provide more information.
0
 
maximyshkaAuthor Commented:
what is XFDF form submision and how does it work? I am newbe with pdf...

here is my problem in detail.

user of my website will upload pdf form with fields. This is regular pdf document i open with pdf reader....

Some fields from this form i need to upload to the MS SQL database, fields like Project name, adress, description..etc...

What my user will do , is just to upload filename.pdf.
I need somehow to transfer some fields from this file to sql database.... And leave file available on network....

i hope i did describe it better this time...

thanx for help..
0
 
maximyshkaAuthor Commented:
Sombody told me that it is imposible to do directly(is it true?) and i need first transfer file to Xml..and from xml transfer to the database...
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
Karl Heinz KremerCommented:
XML has nothing to do with what you want to do.

Here are a few facts that you need to know to build this solution:
The free Adobe Reader cannot save PDF files. So, if you use Reader to fill out a form, this form cannot be saved. The only mechanism to get access to the forms data is by submitting the form to a web server. Submission in this context means that Reader sends the data as FDF or as XFDF to the server. The FDF format is the forms data format, the XFDF format is pretty much the same format, wrapped in XML. In addition to these two formats, you can also submit the data as normal HTML forms submission. you can even use the same CGI program that you would use for a web form.

Are you familiar with "normal" web forms, and how you would process forms submissions with ASP.NET?

The general idea is this:

User downloads the PDF file from your web server and fills out the form. The form does have a "Submit" button, which causes the forms data to be submitted to the web server (HTML form, FDF or XFDF). In either case, the web server needs to run a program to accept and process the data (this would be your ASP.NET program). The important idea here is that you don't have to extract the data from the PDF file, it's already in a format that's much easier to process (e.g. XML).

The most straight forward approach would be to use the HTML forms submission method. This way, you get key value pairs, and don't even have to parse any files for the data. The next step - writing the data into your database - should be pretty straight forward.

I don't use ASP.NET, so I would not be able to tell you how to write the program, but I can give you more information about the forms processing.
0
 
maximyshkaAuthor Commented:
thanx for helping me,
Form processing is very valuable for me, i would like to learn about it. because i'll have to implement something like that in the future..
And if you can also provide me with information, what kind of software i will need to implement solution you just described.....

Right now i need intermediate solution.

They use Adobe writer to write in file... so they can save it...

Now, they e-mail me pdf file, and secretary manually enters it into database. This is government agency... the problem is that other side only wants to give ready made pdf file...they have their own reasons for that. They do not want feel out info on website...

Now, i was asked for solution, where they will upload file(pdf) and i will extract some fields from it to database....

This is need to be done very urgently.... Before i can implement something for form processing....
0
 
maximyshkaAuthor Commented:
And yes i know how to process forms with asp.net
0
 
Karl Heinz KremerCommented:
Do the PDF files that you currently receive have form fields (I'm talking about interactive PDF form fields, not just boxes to fill in data)?
0
 
Karl Heinz KremerCommented:
Here are some links to software that can convert PDF to XML:

http://www.deque.com/products/undoc.html
http://www.exegenix.com/solutions/pdftoxml.html
http://www.cambridgedocs.com/technology_PDF_driver.htm

Here is a whitepaper about problems when converting PDF to XML: http://www.dclab.com/converting_from_pdf.asp

If you are lucky, you may not even need to go to XML: If you can extract the text from the PDF file so that you can then parse the text and extract the data, you can save a lot of money: Give the pdftotext tool from XPdf a try: http://www.foolabs.com/xpdf

0
 
maximyshkaAuthor Commented:
Tahnx, I will defenetely look for these software...

No, they dont have any form fields...
0
 
Karl Heinz KremerCommented:
Depending on how the PDF files were created, you may not be able to extract any useful information: If these are e.g. scanned images, there is no textual information in the files at all, and you first have to OCR the documents (this would be the worst-case scenario). Give it a try.
0
 
maximyshkaAuthor Commented:
no this are the forms which i created in adobe acrobat, they just filled it out...
0
 
Karl Heinz KremerCommented:
How are they filling out these static/non-interactive forms?
0
 
maximyshkaAuthor Commented:
well presently it is static forms which were created in acrobat 4.0 writer.There are are fields , where they can input info.. But , now i can create any forms i want.. It is just form must be downloaded filled out offline and then uploaded back to me.

The rest of the process i explained before.

Also, i have question, is there any way to create form . which can be filled ofline but after upload . when i open it i can press something to add fields i want to db?
0
 
Karl Heinz KremerCommented:
Let me repeat my question: How are these forms filled out today (e..g touchup text tool, free text annotation, ...)? Depending on which mechanism is used, you may have to perform another step before you can extract your data: If free text annotations are used, you may have to flatten the documents first.

You could write a JavaScript that is not associated with the form, but with your instance of Acrobat that would extract the data from the (interactive) form fields and with ADBC submit the information to a database.
This ZIP archive from the PlanetPDF forum (http://www.planetpdf.com) contains a form with JavaScript that "talks" to an Access database:

http://forum.planetpdf.com/scripts/wbpx.dll/~planetpdfforum/upload/Example.zip


0
 
maximyshkaAuthor Commented:
I am sorry, but i have no idea how it calls. I have form and i have created fields with Acrobat 4.0 , I have created fields with "Form Tool(f)", so when I pres hand, i can fill them out and save. I am newbie , with adobe products.

I just saw the file example. It doesnt have have any javascript just call for connect();

But it gives me one idea, which i will try.
I jusat dont know java script syntax.. Can you help me with that?
0
 
maximyshkaAuthor Commented:
Do you know the difference between Adobe Acrobat proffesional or Designer , which one is better.. It looks like both of them create forms..
0
 
Karl Heinz KremerCommented:
OK, so you are using interactive form fields. Yesterday you wrote "No, they dont have any form fields..."
Because you don't have any JavaScript background, I'll try to come up with something. This will take a little longer.

I don't have a lot of experience with Designer. I would stick with Acrobat, it's not as powerful, but a lot easier to use. If you can do everything you need to do with "normal" Acrobat forms, there is no need to change anything.
0
 
Karl Heinz KremerCommented:
Create a new text file named SubmitToDatabase.js in the C:\Program Files\Adobe\Acrobat 6.0\Acrobat\Javascripts directory and copy&paste this JavaScript program into this file:

function submitToDataSource()
{
// connect to data source
      try  
      {
            con = ADBC.newConnection("TestDataSource");      // <-- change "TestDataSource" to your data source name
            if (con == null) throw "Error connecting to data source";
            statement = con.newStatement();
            if (statement == null) throw "Error executing newStatement";
      }
      catch(e)
      {
            app.alert(e);
            return;
      }

// insert the new values into the database
      try
      {
            var updateStr = "INSERT INTO MyTable "            // <-- change "MyTable" to your table name
               + " (VALUE1, VALUE2, VALUE3) "            // <-- change "VALUE1".."VALUE3" to your DB field names
               + " VALUES ('" + this.getField("Field1").value + "',"      // <-- change "Field1".."Field3" to your Acrobat field names
               + "'" + this.getField("Field2").value + "',"
               + "'" + this.getField("Field3").value + "')";

            statement.execute(updateStr);
      }
      catch(e)
      {
            app.alert("execute: " + e);
      }
}


// create a new menu item
app.addMenuItem("JS_SubmitToDatabase", "Submit to Database", "Document", 0,
      "submitToDataSource()", "event.rc = (event.target != null)");

// end of script

Make sure that you replace the things that I marked with the correct names from your environment. This will add a new menu item to the "Document" menu. Whenever you use this menu item, it will write the contents of your fields to the database.
0
 
maximyshkaAuthor Commented:
thanx, it looks great. But i have silly question will it work with Acrobat 4? i am in process of buying Acrobat 6...

Also, how do i create menue item.

I apriciate your help alot. I am totaly new with adobe Acrobat...
0
 
maximyshkaAuthor Commented:
I meant how do i attach script to pdf file...
0
 
maximyshkaAuthor Commented:
Never mind my lastquestion. i got it...
0
 
maximyshkaAuthor Commented:
on acrobat 6.0 reader it says reference Error: ADBC not defined. so i changed it to ODBC . it says ODBC not defind....
0
 
Karl Heinz KremerCommented:
Reader does not support the ADBC object in JavaScript, you need Acrobat 6 for this (it may work with Acrobat 5, but it will definitely not work with Acrobat 4). The menu item is created by the last few lines in the script, so you don't have to do this yourself. And, it uses the open PDF document, so you don't have to attach it to a PDF document.
0
 
Karl Heinz KremerCommented:
... and just in case it's not clear yet: The feature is really called ADBC and not ODBC. It uses ODBC to connect to your database.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 13
  • 11
Tackle projects and never again get stuck behind a technical roadblock.
Join Now