Oracle Text XML and .DOC


I am very green to XML but I am familiar with HTML and SQL. I am working on designing a small application that will use Oracle Text to index and search documents to help improve our department.

I am just running into a roadblock when it comes to a design perspective. I started writing out some documents in MSWord (.doc) format to eventually load into Oracle and index, but then I started thinking XML is much more structured and would probably be indexed more efficiently and with the included metadata search would be more powerful.

Am I correct in this assumption and is it worth it to pick XML over any other format for Oracle Text?
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

You don't necessarily have to pick .doc over .xml or viceversa ... From Office 2007 on, the default format is XML-based:

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
slightwv (䄆 Netminder) Commented:
I don't see a huge performance benefit either way when it comes to Oracle Text.  How Text works is Oracle 'parses' out the words (they call them tokens) from the doc and stores those.

As far as searching goes, it's a draw between XML and Word Docs.  Oracle never accesses them directly.

The only performance difference you may see is the amount of time/overhead to 'index' the different docs.

Another consideration: Version compatibility.  If you go with Word Docs (not XML based), you need to make sure the Word version is compatible with the Oracle Text Lexers.  There is usually a lag time between the latest MSoft versions and Oracle's ability to parse them.
mjfigurAuthor Commented:
Thank you so much for your help that makes perfect sense and will be a huge help to building this search.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Oracle Database

From novice to tech pro — start learning today.