asked on

itextsharp problem

Hi,

It is written in vb.net but I am comfortable wth c# too
I have tried to extract text of pdf document, it is extracting text but along with header and footer. I am not willing to get header and footer but only content of page.

My code is like below:

Any advise?

Thanks

Dim oReader As New iTextSharp.text.pdf.PdfReader(filepath)
        Dim sOut As String = ""
        For i = 1 To oReader.NumberOfPages
            Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
        Next
        System.Diagnostics.Debug.WriteLine(sOut)
        Return sOut

Open in new window

Jesse Houwing

In a PDF file, there isn't a notion of a header and footer, so you'll have to detect that yourself and remove it. PDF is just text on a page.

sindhuxyz

ASKER

Thanks for message.

How can I detect, the header and footer can be any text?

Any advise?

ASKER CERTIFIED SOLUTION

Jesse Houwing

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

sindhuxyz

ASKER

How regular expression can be done because I am getting plain text, here is sample PDF file with header and footer and text as text file which I get using itextsharp

Any advise?

sample003.pdf
extracted.txt

Jesse Houwing

If you have multiple pages, you might be able to figure it out, if there is specific styling, you might be able to use that. With just a single page, it is very hard to do in a precise and automated fashion.

sindhuxyz

ASKER

Is there any way to parse PDF document so,may be I can parse body and leave header and footer.?

Jesse Houwing

In the end a PDF document consists of pages and on those pages there is a DOM Tree just like HTML, but with fewer distinct elements, The PdfRecangles object usually specifies where on the page a piece of text is placed, you might be able to use those to figure out the top and bottom textblock.

The problem is that if your text is flowing around an image or has multiple columns, it becomes quite hard to correctly parse the documents. That is why the PdfTextExtractor can be supplied with a strategy to influence how the document should be read.

Again, the strategy that works with one document might not work with another, depending on how the document is structured internally.

sindhuxyz

ASKER

Thanks