asked on

Extracting PDF title and displaying it in HTML via ASP

I wrote up some ASP code that reads all pdf files in a directory and displays it by its filename.

In HTML (via ASP) I can display the files by filename, but want to display it by its PDF Title name. So basically, I need to know a way to extract the PDF Title from the pdf file and display it in HTML.

ainapure

you will need some kind of component and/or tool to extract title from the pdf file. Try searching for it on google.

-amit

mikosha

I've got some idea and may be it will work:
If you open any pdf file as a ascii file ,you'll something like this at the top:

%PDF-1.4
%âãÏÓ
1 0 obj
<<
/Producer (Acrobat Distiller Command 3.01 for Solaris 2.3 and later \(SPARC\))
/Creator (FrameMaker 5.5.6.)
/ModDate (D:20031202132023-05'00')
/CreationDate (D:19960530152336Z)
/Title (title)
>>

Actual title is in brackets (for this example title is "title").So ,if the pdf file has a title so it will be at the same position and you can find this place (either by serching for "/Title" keyword or going to exact line) and read a title and even more pdf info about this file (everything that is inside << >>).

Hope it will work.
cheers:)

ainapure

initial comments.

1)You have to open PDF file and try to read the metadata
2) Store and display the extracted metadata

Dont exactly know how you would go about it at this time. I am sure there should be some component to do this.

-amit

tobiason

ASKER

mikosha,
that is a big clue. then the question would be finding a way to open each pdf file as a .txt file and searching for the "/Title" keyword and taking the string inside its bracket.

that's a lot more coding to write up. does anyone have a simplier way to pull the title in (hopefully, with a library line of code. ;) unless, if you've done coding like mikosha mentions, i could use assistance in that.

amit,
what kind of component would you suggest i need or search for?

mikosha

ok,it was just an idea :)
If you're considering to use third party components i think it will be much easy to emplement. But it costs :)
Actualy open any file as a text is not so big deal using OpenTextFile Method of FileSystemObject (you will get a TextStream object as a result) and then store all the text in a string variable using ReadAll Method of TextStrem Object. After that to find a position by keyword you can use instr() vbscript function.
Thats all folks (about 5-10 line of code). And you'll use only built-in object of IIS .
But decission is yours, i just wanted to show that the clue is not so big :)

cheers:)

tobiason

ASKER

This tactic is working great so far.
However, I am not sure how to pull the data from within the <pdf:Title> bracket.
To clearify, so far I can open the pdf file and locate the bracket. But what do I do to find the Title that's in between the brackets (ex. <pdf:Title>PDF Title Goes Here</pdf:Title>)

Here is what I have:

PDFpath = "Current/" & PDFfilename

const ForReading = 1
const TristateFalse = 0
dim strSearchThis
dim objFS
dim objFile
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objFile = objFS.GetFile(Server.MapPath(PDFpath))
set objTS = objFile.OpenAsTextStream(ForReading, TristateFalse)

strSearchThis = objTS.Read(objFile.Size)

if instr(strSearchThis, "<pdf:Title>") > 0 then
Response.Write "Found Title Bracket!"
end if

mikosha

I think after this you have to make a search for "</pdf:Title>" keyword .
Let say A is the first position(from instr(strSearchThis, "<pdf:Title>") )
and B is the secon one (from instr(strSearchThis, "</pdf:Title>")).
So your title will be between A+11 (which is len("<pdf:Title>") ) and B.
I think the final thing will be something like this :

current_title = Mid(strSearchThis, A+11, B)

mikosha

And if it works ,you can proudly call this "KindOfLittleXMLparser" :)
(By the way , you could use XML parser to retrieve this title too)

ASKER CERTIFIED SOLUTION

mikosha

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tobiason

ASKER

It works...However, it comes out with all addition crap from the pdf (ascii read). It seems like there needs to be some tweaking involved with the instr() stuff.

You can see for yourself, this is what I have:

PDFpath = "Current/" & PDFfilename
const ForReading = 1
const TristateFalse = 0
dim strSearchThis
dim objFS
dim objFile
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objFile = objFS.GetFile(Server.MapPath(PDFpath))
set objTS = objFile.OpenAsTextStream(ForReading, TristateFalse)

strSearchThis = objTS.Read(objFile.Size)

if instr(strSearchThis, "<pdf:Title>") > 0 then
TitleStart = instr(strSearchThis, "<pdf:Title>")
TitleEnd = instr(strSearchThis, "</pdf:Title>")
Title = Mid(strSearchThis, TitleStart+11, TitleEnd)
end if

Response.Write ("<tr><td valign='top'><a href=Current/" & PDFfilename & ">" & Title & "</a></td></tr>")

ALMOST THERE!!! This is kinda cool, by the way, just like XML parsing!
Paul
PS: I'm leaving work in ten minutes, will be back tomorrow, and will credit ya 500 points when this is complete. Thanks!!!

tobiason

ASKER

NEVERMIND! Your correction fixed it!
You should get your 500 points!
Thanks a bunch for this. It seems that this is a common issue that's been unsolved regarding my searches via google.com

Cheers!!!

mikosha

Thanx :)