sindhuxyz
asked on
itextsharp problem
Hi,
It is written in vb.net but I am comfortable wth c# too
I have tried to extract text of pdf document, it is extracting text but along with header and footer. I am not willing to get header and footer but only content of page.
My code is like below:
Any advise?
Thanks
It is written in vb.net but I am comfortable wth c# too
I have tried to extract text of pdf document, it is extracting text but along with header and footer. I am not willing to get header and footer but only content of page.
My code is like below:
Any advise?
Thanks
Dim oReader As New iTextSharp.text.pdf.PdfReader(filepath)
Dim sOut As String = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
System.Diagnostics.Debug.WriteLine(sOut)
Return sOut
In a PDF file, there isn't a notion of a header and footer, so you'll have to detect that yourself and remove it. PDF is just text on a page.
ASKER
Thanks for message.
How can I detect, the header and footer can be any text?
Any advise?
How can I detect, the header and footer can be any text?
Any advise?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
How regular expression can be done because I am getting plain text, here is sample PDF file with header and footer and text as text file which I get using itextsharp
Any advise?
sample003.pdf
extracted.txt
Any advise?
sample003.pdf
extracted.txt
If you have multiple pages, you might be able to figure it out, if there is specific styling, you might be able to use that. With just a single page, it is very hard to do in a precise and automated fashion.
ASKER
Is there any way to parse PDF document so,may be I can parse body and leave header and footer.?
In the end a PDF document consists of pages and on those pages there is a DOM Tree just like HTML, but with fewer distinct elements, The PdfRecangles object usually specifies where on the page a piece of text is placed, you might be able to use those to figure out the top and bottom textblock.
The problem is that if your text is flowing around an image or has multiple columns, it becomes quite hard to correctly parse the documents. That is why the PdfTextExtractor can be supplied with a strategy to influence how the document should be read.
Again, the strategy that works with one document might not work with another, depending on how the document is structured internally.
The problem is that if your text is flowing around an image or has multiple columns, it becomes quite hard to correctly parse the documents. That is why the PdfTextExtractor can be supplied with a strategy to influence how the document should be read.
Again, the strategy that works with one document might not work with another, depending on how the document is structured internally.
ASKER
Thanks