A good solution for organizing and searching a lot of PDF and Docx files

Hello experts
We have a lot of Docx (word documents) and PDF files (Mostly Unicode UTF8).
We need to make a structure to hold, sort and search these files (text and title search). Like a big library database.
We need the solution to have a local and online database which can be synced and used.
I know a little about databases. And I can develop a windows application to use a database (like Access database).

How do you suggest to implement a good structure. We may later develop an android app too. So this so called system should be able to handle different kind of requests. It needs to be able to connect to other services like an online shop.

Thank you in advance
LVL 11
Arman KhodabandeIT Manager and ConsultantAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Shalom CarmelCTOCommented:
Try solr as the underlying index and search engine
http://lucene.apache.org/solr/

Then develop your own front end to query and display results.
Mohammed KhawajaManager - Infrastructure:  Information TechnologyCommented:
You could use Alfresco document management which does index PDF and DOCX files.  You could get it form http://www.alfresco.com
gheistCommented:
On windows that would be adding ifilters to extract data on the machine where you have files.
Search indexes are not SQL databases.
Your Guide to Achieving IT Business Success

The IT Service Excellence Tool Kit has best practices to keep your clients happy and business booming. Inside, you’ll find everything you need to increase client satisfaction and retention, become more competitive, and increase your overall success.

Theo KouwenhovenApplication ConsultantCommented:
YOu need a full-text database, in the past Lotus Agenda did this, but the disadvantage is that a full text Db can't be converted to other DB's.
Please check also this links:
https://technet.microsoft.com/nl-nl/library/ms189520(v=sql.105).aspx
https://technet.microsoft.com/nl-nl/library/ms187317(v=sql.105).aspx

Maybe it will help
ZberteocCommented:
FULLTEXT CTALOG and FULLTEXT INDEXes have nothing to do with documents. They work with database table text columns that reside inside the databases. In this case I would definitely not load all those documents in a database it is just a waste of resources. You need something that deals with files in file system in windows and software that is specialized in that. If you really need a reference inside the database to those documents their location, path, would be more than enough.
Sjef BosmanGroupware ConsultantCommented:
nociSoftware EngineerCommented:
While .DOCX holds mainly text records, that are easily scanned. The .PDF files might consist of bit images that are not that easy to index. First this is needed https://www.snowtide.com/help/indexing-pdf-documents-with-lucene-and-pdfxstream if lucene/solr is used for storage. (It will need a bitmap -> text scanning tool to convert text on it).

On a  site i worked at there were problems with these kind of documents where crucial parts of the documents were graphic because of the ways they were produced.  We ended up with merging the PDF documents into PDF documents + some white on white text fields added. A print won't show them, a computerised scanner could index on those text fields which stayed text.
Arman KhodabandeIT Manager and ConsultantAuthor Commented:
Thank you all for your comments. I'm gonna study those solutions and see which one is better. I'll get back to you as soon as possible...
Arman KhodabandeIT Manager and ConsultantAuthor Commented:
Can any of the above solution be used to handle mobile apps? (I mean a mobile app be designed to connect to the server and search?
Has anyone tried sphinx ?
Has anyone heard of "Elastic search" and "swiftype"?

@shalomc
I'm currently comparing sphinx and solr ..
They both seem to be solid and perfect systems for doc management. And they're both free. That's perfect.
Do you have any experience in them?
Which is better from a security and being-easy-to-use aspect?
In one of articles about solr I read that it required MySQL, jetty and some more components.
Do I need to have knowledge about java, mysql , etc or a basic Linux and scripting knowledge will do the trick?

@noci
I don't know about snowtide and pdfxstream. Are they plugins/addons of the lucene/solr ?
I hope they're free and can be used by solr. I'm getting more and more fascinated by solr.
Can you clarify on pros and cons of lucene/solr? what's their difference, if you have experience in using them?

@Mohammed Khawaja
Alfresco is a paid solution, but the mobile app support including iOS and Android is good.
I may go for Community edition, if convinced enough.
Do you have any experience using their software?

@Zberteoc
Do you know a working method by experience?

@theo kouwenhoven
Lutus Agenda seems to be a DOS old solution!

@gheist
@Sjef Bosman
Local search is not intended.
Mohammed KhawajaManager - Infrastructure:  Information TechnologyCommented:
I used Alfresco years ago but I know people who are still using.  At my work, we are a Microsoft shop and thus we use SharePoint with SurfRay Ontoloca for indexing and searching.
ZberteocCommented:
Using FILESTREAM is a good compromise if you really want to "marry" an SQL database with outside documents. It makes a link between the database and the folders where the documents are but it doesn't really load the documents into the database:

http://social.technet.microsoft.com/wiki/contents/articles/9809.store-and-index-documents-in-sql-server-2012-an-end-to-end-walkthrough.aspx

Other than that you would simply create a network location for your files and then some tables that will quickly identify their name and location.
Arman KhodabandeIT Manager and ConsultantAuthor Commented:
Still waiting for other experts input!

@Mohammed Khawaja
Was there any reason to set Alfresco aside?
You seem to be from Arab countries, and that's perfect. Can those softwares  handle UTF8 word and pdfs? How about arabic?

@Zberteoc
Thank you that was very valuable information. Learned a lot.
gheistCommented:
Biggest problem with arabic was (and is) rendering, whole alphabet is in very old Unicode releases already. Rendering problem is pretty minor compared to rendering Korean script intermixed with latin letters...

Why to put alfresco aside? Because microsoft promises reduced price and sponsored certifications if you sink in their produce.
Mohammed KhawajaManager - Infrastructure:  Information TechnologyCommented:
Where I work, the company has decided to reduce the number of partners it keeps.  We have few partners (i.e. SAP for ERP, Windows for OS, SQL for DB, SharePoint for collaboration, Kaspersky for anti-virus, etc.).
Shalom CarmelCTOCommented:
I really don't know anything about sphinx, neither good nor bad.
solr has been around for a longer time, and has proven itself over and over again. One advantage solr has over sphinx, is that it has hundreds of solid integrations. For example, office files and pdf indexing and searching are inherently built into solr, whereas with sphinx you get a code sample to work out.

If I were to develop a web site or product that needs custom searching, I would consider the search engine to be part of the product r&d, and would seriously consider one or the other on the merit of whichever features are important for the product.

However, if I were in your position and had to implement what seems to be an IT solution of the "fire and forget" sort, I would be looking for something that does not require extensive coding during installation and post installation.

If I had the money I would take a commercial tool like Alfresco, and if I did not have the budget for that I would take solr. It will require less maintenance work that sphinx.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
nociSoftware EngineerCommented:
Alfresco still has a community edition, you need a candle to find it, but it is still there.
The community edition is free.

https://www.alfresco.com/alfresco-community-download
Arman KhodabandeIT Manager and ConsultantAuthor Commented:
Thank you all

@shalomc
Thanks for your comprehensive explanation.

@noci
Yeah, I even saw their comparison chart. I should try it out sometime.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Databases

From novice to tech pro — start learning today.