Link to home
Start Free TrialLog in
Avatar of sindhuxyz
sindhuxyz

asked on

itextsharp problem

Hi,

It is written in vb.net but I am comfortable wth c# too
I have tried to extract text of pdf document, it is extracting text but along with header and footer. I am not willing to get header and footer but only content of page.

My code is like below:

   


Any advise?

Thanks
Dim oReader As New iTextSharp.text.pdf.PdfReader(filepath)
        Dim sOut As String = ""
        For i = 1 To oReader.NumberOfPages
            Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
        Next
        System.Diagnostics.Debug.WriteLine(sOut)
        Return sOut

Open in new window

Avatar of Jesse Houwing
Jesse Houwing
Flag of Netherlands image

In a PDF file, there isn't a notion of a header and footer, so you'll have to detect that yourself and remove it. PDF is just text on a page.
Avatar of sindhuxyz
sindhuxyz

ASKER

Thanks for message.

How can I detect, the header and footer can be any text?

Any advise?
ASKER CERTIFIED SOLUTION
Avatar of Jesse Houwing
Jesse Houwing
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
How regular expression can be done because I am getting plain text, here is sample PDF file with header and footer and text as text file which I get using itextsharp

Any advise?


sample003.pdf
extracted.txt
If you have multiple pages, you might be able to figure it out, if there is specific styling, you might be able to use that. With just a single page, it is very hard to do in a precise and automated fashion.
Is there any way to parse PDF document so,may be I can parse body and leave header and footer.?
In the end a PDF document consists of pages and on those pages there is a DOM Tree just like HTML, but with fewer distinct elements, The PdfRecangles object usually specifies where on the page a piece of text is placed, you might be able to use those to figure out the top and bottom textblock.

The problem is that if your text is flowing around an image or has multiple columns, it becomes quite hard to correctly parse the documents. That is why the PdfTextExtractor can be supplied with a strategy to influence how the document should be read.

Again, the strategy that works with one document might not work with another, depending on how the document is structured internally.
Thanks