?
Solved

Urgent: Using an index to get the documents from database (JSP & MySQL)

Posted on 2003-03-27
17
Medium Priority
?
406 Views
Last Modified: 2007-12-19
Hello All

I have been able to upload documents into MySQL database and they are of pdf type, I wanted to  index them and therefore like 'Google' you place word in its textfield and submit it, it goes looking for that particular match.

I wish to do the same.

I'm using these technologies:
1. tomcat 4.0.4
2. MySQL
3. JSP

I would like some examples of how to do this, if anyone can help I would be very grateful.
Thanks again.

Juggy
0
Comment
Question by:Juggy1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 6
  • 3
17 Comments
 
LVL 19

Expert Comment

by:cheekycj
ID: 8218964
I am not aware of anything that does this that is plug and play.

My suggestion would be this:

store PDF in a directory.
store path to pdf and text of pdf in the DB for searching and indexing.

since the PDF is stored in binary format in the DB, I don't think searching it directly in the DB is doable and if it is, definitely not simple :-)

CJ
0
 
LVL 19

Expert Comment

by:cheekycj
ID: 8218993
also keep in mind that storing the pdf in the DB means that if one file gets corrupted you have to do a refresh of the DB, if the files are external.. you can just replace the corrupt one externally.

CJ
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 14

Expert Comment

by:kennethxu
ID: 8219372
I think you need to know the fact that, you cannot directly search pdf file, regardless of where it is, be in database or filesystem.

the fact is that you need to extract text from pdf, and then you can index it. there are some tools availabel to do this:
http://www.topshareware.com/PDF-Plain-Text-Extractor-download-2346.htm
http://www.verypdf.com/pdf2txt/pdf2txt.htm
http://www.pdfstore.com/details.asp?ProdID=591
0
 

Author Comment

by:Juggy1
ID: 8247561
Thanks all for the info.

So how would I extract text from pdf before I upload the info to my db.

I am using the UploadBean from 'JavaZoom' (www.javazoom.net) and how would I be able to do this?

I would be grateful if you could explain.

Thanks again
Juggy

0
 

Author Comment

by:Juggy1
ID: 8251861
hello all

Im my previous message, I have stated how to extract using those tools that were given.

However, I have the Uploadbean working but cannot get the filepath. It actually inserts to the database.
As I have stated earlier the code is from www.javazoom.net
using their UploadBean example.

Therefore is it possible if you can take a look. I will email and zip it to you both.

Thanks again
Juggy
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8258501
find you an open source extractor:
http://www.jpedal.org/home.html

I think you should be able to extract it from memory.
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8258505
look into the UploadFile class.
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8258536
UploadFile.getData will return you pdf content, you can then use the jpedal to extract text from it.

http://www.javazoom.net/jzservlets/uploadbean/documentation/api/javazoom/upload/UploadFile.html
0
 

Author Comment

by:Juggy1
ID: 8268525
Thanks for the info it is very helpful.

I have also seen the UploadFile class and seen getData() method. I was just wondering I have nearly finished writing the code to store pdf documents in a folder and get the filepaths stored in the MySQL.

But using the UploadFile.getData with jpedal to extract text is there a good example of the code to do this because I couldn't find anything.

I would be grateful for any help or suggestions.
Thanks again
Juggy
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8270065
>> But using the UploadFile.getData with jpedal to extract text is there a good example of the code to do this because I couldn't find anything.
I'm afraid not. but think this way.

the UploadFile.getData gives you the pdf content in byte[], the jpedal's PdfDecoder takes byte[] as parameter to one of it's openPdfFile method.

void openPdfFile(byte[] data)

jpedal api doc: http://www.jpedal.org/docs/JPedalAPI.pdf

so just find any kind of jpedal example will do the purpose.

also, check out the API, that's where we start to learn new thing :)
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8270202
ha, jpedal used to have example available at their site, now they start to charge for the examples and supports.

the only thing I can find is this discussion:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg02411.html

I guess you'll have to study the API.
0
 

Author Comment

by:Juggy1
ID: 8278875
Hello All

I have had a look at the Jpedal API and started working on some of its code from the download files.

Here is what I got:

package petshowbeans;

import javazoom.upload.UploadFile;
import java.io.*;
import java.util.*;
import java.sql.*;
import java.io.BufferedReader;
import org.jpedal.xobjects.*;
import org.jpedal.color.*;
import org.jpedal.fonts.*;
import org.jpedal.images.*;
import org.jpedal.io.*;
import org.jpedal.objects.*;
import org.jpedal.exception.*;
import org.jpedal.text.*;
import org.jpedal.utils.*;
import org.jpedal.utils.repositories.*;

public class PdfDecoder extends PdfEngine
{
      /**version number*/
      public static final String version = "1.90";
      
      /**flag to show if form*/
      private boolean isForm=false;
      
      /**holds id of all pages read*/
      private Map pagesRead = new Hashtable();
      
      /** holds the AcroForm data */
      private PdfFormData currentAcroFormData = new PdfFormData();
      
      /** holds the annotation data */
      private PdfAnnots annotsData = new PdfAnnots();
      
      /**Any XML metadata as a string*/
      private String metadata="";
      
      /** count of how many pages loaded */
      private int pageCount = 0;
      
      /**holds pdf id (ie 4 0 R) which stores each object*/
      private Map pagesReferences = new Hashtable();      
      
      
      /**
       * Create a PdfDecoder which will extract data from a
       * pdf file and store it in a PdfData object.
       */
      public PdfDecoder(boolean useClient) throws PdfException {
            
            /**get local handles onto flag passed in*/
            this.useClient = useClient;
            
            /***create Postscript command list lookup table*/
            setupCommandList();
            
            /**initialise objects used (mostly defined in pdfObjects.java)*/
            currentImageData = new PdfImages();
            currentPdfFile = new PdfObjectReader();
            currentFontData = new PdfFontsData(currentPdfFile);
            currentCIDFontData = new PdfCIDFontsData(currentPdfFile);
            currentXobjectData =new PdfXObjects(currentPdfFile, currentImageData);
            currentColorData = new PdfColor(currentPdfFile);
            pdfData = new PdfData();
            pdfImages = new PdfImageData();
            
            LogWriter.writeLog("Pdf code initialised");
            
      }
      
      /**
       * convenient method toclose the pdf file
       *  (for access by outside class)
       */
      final public void closePdfFile() {
            currentPdfFile.closePdfFile();
      }
      
      /**
       * provide method for outside class to get data object containing information
       * on the page for calculating grouping
       */
      final public PdfPageData getPdfPageData() {
            return pageData;
      }
      
      /**
       * provide method for outside class to get data object containing text
       */
      final public PdfData getPdfData() {
            return pdfData;
      }
      
      /**
       * provide method for outside class to
       * clear store of objects once written out
       * to reclaim memory
       */
      final public void flushObjectValues(boolean reinit) {
            pdfData.flushTextList(reinit);
            annotsData.flushTextList();
            flushFormValues();
            
            if (reinit == false)
                  pdfImages.clearImageData();
      }
      
      /**
       * provide method for outside class to clear PdfData object
       */
      final public void clearPdfImageData() {
            pdfImages.clearImageData();
      }
      
      
      /**
       * read the form data from the file
       */
      final private void readAcroForm(String currentFormOffset,boolean showLines)
            throws PdfException {
            
            String value = "",formObject="";
            
            LogWriter.writeLog("Form data being read");
            
            /**
             * read form object metadata
             */
            Map values = readValues(currentFormOffset, showLines);
            
            /**read the fields*/
            value= (String) values.get("Fields");                  
            
            /**strip the braces*/
            value=Strip.removeArrayMarkers(value);
            
            /**read each form object*/
            try {
                  StringTokenizer initialValues =new StringTokenizer(value, "R");
                  while (initialValues.hasMoreTokens()) {
                        
                        formObject =initialValues.nextToken().trim() + " R";
                        
                        readAcroFormField(formObject);
                        
                  }
            } catch (Exception e) {
                  LogWriter.writeLog("Exception "+ e+ " reading form object "+formObject+" from "+values);
                  throw new PdfException("Exception");
            }      
      }
      
      /**
       * read the form data from the file
       */
      final private String readMetadata(String currentOffset)
            throws PdfException {
            
            String line="";
            StringBuffer XMLObject=new StringBuffer();
            
            LogWriter.writeLog("XML Metadata being read");
            
            try {
                  
                  BufferedReader mappingStream =
                        currentPdfFile.readTextObjectData(currentOffset);
                  
                  //read values into lookup table
                  if (mappingStream != null) {
                        
                        while (true) {
                              line = mappingStream.readLine();
                              
                              if (line == null)
                                    break;
                              
                              //append to XML data
                              XMLObject.append(line);      
                              XMLObject.append('\n');
                        }
                  }
                  
            } catch (Exception e) {
                  LogWriter.writeLog("Exception "+ e+ " reading XML object "+currentOffset);
                  throw new PdfException("Exception");
            }      
            
            return XMLObject.toString();
      }
      
      /**
       * read page header and extract page metadata
       */
      final private void readResources(String resources) throws PdfException {
            String value;
            Map resource_values;
            
            //remember current location
            long current_pointer = currentPdfFile.getPointer();
            if (debugLevel > 1)
                  LogWriter.writeLog("Reading resources object " + resources);
            
            //from stream or indirect ref
            if (resources.endsWith("R"))
                  resource_values =
                  currentPdfFile.readObjectData(resources, false, false, false);
            else
                  resource_values =
                  currentPdfFile.decodeStringIntoValue(resources, false);
            
            //decode fonts
            value = (String) resource_values.get("Font");
            if (value != null)
                  readFonts(value);
            
            //decode colourspaces
            value = (String) resource_values.get("ColorSpace");
            if (value != null)
                  currentColorData.readColorSpaces(
                  currentPdfFile.getValue(value));
            
            //decode procs
            value = (String) resource_values.get("ProcSet");
            if (value != null)
                  readProc(value);
            
            //XObjects
            value = (String) resource_values.get("XObject");
            if ((value != null)&& (can_access_images == true)&& (processImages == true)) {
                  
                  //don't decode again if we already have it
                  try {
                        currentXobjectData.processXObjects(currentColorData,useClient,value);
                  } catch (Exception e) {
                        LogWriter.writeLog("Exception " + e + " processing XObjects");
                  }
            }
            
      }
      
      /**
       * read the font information from the page
       */
      final private void readFonts(String raw_values) throws PdfException {
            
            if (debugLevel > 1)
                  LogWriter.writeLog("Reading fonts");
            
            Map values;
            String font_object = "", type = "", subtype = "";
            String font_id;
            
            //get number of fonts
            long file_ref = currentPdfFile.getPointer();
            //save current file location
            currentPdfFile.movePointer(file_ref);
            
            //strip << >>
            raw_values = Strip.stripMainBraces(raw_values);
            StringTokenizer font_objects = new StringTokenizer(raw_values);
            
            //allow for an object
            if (font_objects.countTokens() == 3) {
                  raw_values = currentPdfFile.readXObjectHeader(raw_values);
                  font_objects = new StringTokenizer(raw_values);
            }
            
            //work through each item in turn
            while (font_objects.hasMoreTokens()) {
                  font_id = font_objects.nextToken().substring(2);
                  font_object =
                        font_objects.nextToken()
                        + " "
                        + font_objects.nextToken()
                        + " "
                        + font_objects.nextToken();
                  
                  //get values
                  values =
                        currentPdfFile.readObjectData(
                        font_object,
                        false,
                        false,
                        false);
                  type = (String) values.get("Type");
                  
                  /**
                   * make sure it is a font
                   */
                  if (type.equals("/Font")) {
                        //deal with types
                        subtype = (String) values.get("Subtype");
                        if (subtype.equals("/Type0")) {
                              currentCIDFontData.readFontType0(values, font_id);
                        }else if (subtype.equals("/CIDFontType0")) {
                              currentCIDFontData.readCIDFontType0(values, font_id);
                        }else if (subtype.equals("/CIDFontType2")) {
                              currentCIDFontData.readCIDFontType2(values, font_id);      
                        } else if (
                              subtype.equals("/Type1")
                              || subtype.equals("/Type1C")
                              || subtype.equals("/TrueType")
                              || subtype.equals("/Type3"))
                              currentFontData.readAllTypeFont(values, font_id);
                        else
                              LogWriter.writeLog(
                              "Font type " + subtype + " not supported");
                  } else
                        LogWriter.writeLog("Not a font object");
            }
            if (debugLevel > 1)
                  LogWriter.writeLog("Fonts read");
      }
      
      
      /**
       * routine to open as a byte stream and extract key info from pdf
       * file so we can decode any pages. Does not actually decode the
       * pages themselves.
       */
      final public void openPdfFile(byte[] data) throws PdfException {
            
            /**get reader object to open the file*/
            currentPdfFile.openPdfFile(data);
            
            /**read and log the version number of pdf used*/
            LogWriter.writeLog("Pdf version : " + currentPdfFile.getType());
            
            /**read reference table so we can find all objects and
             * also say if encrypted*/
            String root_id = currentPdfFile.readReferenceTable(useClient);
            
            /**
             * read the catalog
             */
            LogWriter.writeLog("Reading catalog");
            Map values =
                  currentPdfFile.readObjectData(root_id, false, false, false);
            
            /**read any XML info and assign to global value*/
            String value= (String) values.get("Metadata");
            if (value != null)
                  metadata=readMetadata(value);
            else
                  metadata="";
            
            /**get pointer to pages and read the read page info*/
            value = (String) values.get("Pages");
            if (value != null)
                  readAllPageReferences(value);
            
            /**Read any form data*/
            value = (String) values.get("AcroForm");
            if (value != null){
                  readAcroForm(value,true);
                  isForm=true;
            }else{
                  isForm=false;
            }
            
            /**store file name for use elsewhere as part of ref key without .pdf*/
            currentXobjectData.storeFileName("<raw data>");
      }
}
      


Can you please take a look and tell me where I am going wrong. I have not stated the UploadFile.getData() method.
Therefore I would be grateful for more suggestions and hints.

Thanks again for all your help.
Juggy
0
 
LVL 14

Accepted Solution

by:
kennethxu earned 900 total points
ID: 8280671
Juggy, I think you should make use of PdfDecoder class instead of change it. the step i can see is:

1. get the pdf content from upload file.
UploadFile file = ....
type[] pdf = file.getData();

2. create a PdfDecoder instance.
PdfDecoder decoder = new PdfDecoder(...);

3. set the pdf content int decoder.
decoder.openPdfFile( pdf );

4. call methods in PdfDecoder to get text content
...... // try to figure it out from the sample and api doc.
0
 

Author Comment

by:Juggy1
ID: 8317978
May I keep this question open I might still have few questions to ask?

Thanks again
Juggy
ps I am still trying to work on the pdf extraction, its a bit slow.
0
 

Author Comment

by:Juggy1
ID: 8486844
Thanks for all your help and guidance.

Juggy
ps Sorry about the long wait for awarding the points, I have been busy with other things.
0
 
LVL 14

Expert Comment

by:kennethxu
ID: 8495353
is it working?
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

New style of hardware planning for Microsoft Exchange server.
Check out the latest tech news, community articles, and expert highlights in August's newsletter.
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
Add bar graphs to Access queries using Unicode block characters. Graphs appear on every record in the color you want. Give life to numbers. Hopes this gives you ideas on visualizing your data in new ways ~ Create a calculated field in a query: …
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question