Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Is IFilter or Parsing Faster?

Posted on 2006-05-05
4
Medium Priority
?
228 Views
Last Modified: 2008-02-26
Hello,

I am trying to create a parser that parsers documents such as .html, .asp, or .doc (word).

So if a document had words such as:    Today we will be going over the new APPLE technology.  EXPERT is a new technology coming out in 06.  
It would find words like Apple and Expert.


I have looked into the IFilter technology, and it seems like it would work.  

But i was wondering, would it be faster to use a technology such as IFilter, or to just parse the text word by word (that is, groups of letters inbetween each space : ex    Hi APPLE bye  <-- apple is inbetween spaces), and then look at that word, and then see if it exists in my list of known words.


Thanks,

0
Comment
Question by:alexthecodepoet
  • 2
  • 2
4 Comments
 
LVL 9

Accepted Solution

by:
pallosp earned 2000 total points
ID: 16618433
IFilter has some overheads that you should count on.
- Calling a COM component is slower than calling your own method, but if you use in-process components, this performance loss is negligable
- The component that implements IFilter may check the validity of the document and do some extra tasks that you don't need, I think this takes the most time.

If high performance is a very important factor, html parsing can be done faster by native code designed explicitely to extract words from a html document. But a Word document is very complex, so IFilter is the only economic solution.

CPU time spent in certain methods can be measured by CLR Profiler, try this first before you start implementing custom parsers.
0
 

Author Comment

by:alexthecodepoet
ID: 16618472
Thank you.

Is IFilter the norm?
0
 
LVL 9

Expert Comment

by:pallosp
ID: 16618592
IFilter is a widely spread standard interface. The manufacturers of all popular document types offer COM components with that interface.
Naturally there exist individual solutions for certain document types, especially for html, but the ability of handling html and doc in the same way means more advantage than a small performace gain using third party components.
0
 

Author Comment

by:alexthecodepoet
ID: 16618617
Thank you Pallosp
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
In a question here at Experts Exchange (https://www.experts-exchange.com/questions/29062564/Adobe-acrobat-reader-DC.html), a member asked how to create a signature in Adobe Acrobat Reader DC (the free Reader product, not the paid, full Acrobat produ…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question