itextsharp problem

Hi,

It is written in vb.net but I am comfortable wth c# too
I have tried to extract text of pdf document, it is extracting text but along with header and footer. I am not willing to get header and footer but only content of page.

My code is like below:

   


Any advise?

Thanks
Dim oReader As New iTextSharp.text.pdf.PdfReader(filepath)
        Dim sOut As String = ""
        For i = 1 To oReader.NumberOfPages
            Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
        Next
        System.Diagnostics.Debug.WriteLine(sOut)
        Return sOut

Open in new window

LVL 5
sindhuxyzAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
In a PDF file, there isn't a notion of a header and footer, so you'll have to detect that yourself and remove it. PDF is just text on a page.
sindhuxyzAuthor Commented:
Thanks for message.

How can I detect, the header and footer can be any text?

Any advise?
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
That's the general problem. If the header and footer follow some recognizable pattern or tend to be the same for multiple pages at a time, you could try to write a set of regular expressions to filter them out. It's easiest if you're extracting the text page by page and not all at once.

You could also let the user drag a box around the actual text area and only extract text that is placed there. If you have Acrobat Professional, there's a tool which allows you to do just that.

There is no clear cut way which will work for every document though.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Exploring SQL Server 2016: Fundamentals

Learn the fundamentals of Microsoft SQL Server, a relational database management system that stores and retrieves data when requested by other software applications.

sindhuxyzAuthor Commented:
How regular expression can be done because I am getting plain text, here is sample PDF file with header and footer and text as text file which I get using itextsharp

Any advise?


sample003.pdf
extracted.txt
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
If you have multiple pages, you might be able to figure it out, if there is specific styling, you might be able to use that. With just a single page, it is very hard to do in a precise and automated fashion.
sindhuxyzAuthor Commented:
Is there any way to parse PDF document so,may be I can parse body and leave header and footer.?
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
In the end a PDF document consists of pages and on those pages there is a DOM Tree just like HTML, but with fewer distinct elements, The PdfRecangles object usually specifies where on the page a piece of text is placed, you might be able to use those to figure out the top and bottom textblock.

The problem is that if your text is flowing around an image or has multiple columns, it becomes quite hard to correctly parse the documents. That is why the PdfTextExtractor can be supplied with a strategy to influence how the document should be read.

Again, the strategy that works with one document might not work with another, depending on how the document is structured internally.
sindhuxyzAuthor Commented:
Thanks
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.