Link to home
Start Free TrialLog in
Avatar of Arman Khodabande
Arman KhodabandeFlag for Iran, Islamic Republic of

asked on

A good solution for organizing and searching a lot of PDF and Docx files

Hello experts
We have a lot of Docx (word documents) and PDF files (Mostly Unicode UTF8).
We need to make a structure to hold, sort and search these files (text and title search). Like a big library database.
We need the solution to have a local and online database which can be synced and used.
I know a little about databases. And I can develop a windows application to use a database (like Access database).

How do you suggest to implement a good structure. We may later develop an android app too. So this so called system should be able to handle different kind of requests. It needs to be able to connect to other services like an online shop.

Thank you in advance
Avatar of Shalom Carmel
Shalom Carmel
Flag of Israel image

Try solr as the underlying index and search engine
http://lucene.apache.org/solr/

Then develop your own front end to query and display results.
You could use Alfresco document management which does index PDF and DOCX files.  You could get it form http://www.alfresco.com
On windows that would be adding ifilters to extract data on the machine where you have files.
Search indexes are not SQL databases.
YOu need a full-text database, in the past Lotus Agenda did this, but the disadvantage is that a full text Db can't be converted to other DB's.
Please check also this links:
https://technet.microsoft.com/nl-nl/library/ms189520(v=sql.105).aspx
https://technet.microsoft.com/nl-nl/library/ms187317(v=sql.105).aspx

Maybe it will help
FULLTEXT CTALOG and FULLTEXT INDEXes have nothing to do with documents. They work with database table text columns that reside inside the databases. In this case I would definitely not load all those documents in a database it is just a waste of resources. You need something that deals with files in file system in windows and software that is specialized in that. If you really need a reference inside the database to those documents their location, path, would be more than enough.
SOLUTION
Avatar of noci
noci

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Arman Khodabande

ASKER

Thank you all for your comments. I'm gonna study those solutions and see which one is better. I'll get back to you as soon as possible...
Can any of the above solution be used to handle mobile apps? (I mean a mobile app be designed to connect to the server and search?
Has anyone tried sphinx ?
Has anyone heard of "Elastic search" and "swiftype"?

@shalomc
I'm currently comparing sphinx and solr ..
They both seem to be solid and perfect systems for doc management. And they're both free. That's perfect.
Do you have any experience in them?
Which is better from a security and being-easy-to-use aspect?
In one of articles about solr I read that it required MySQL, jetty and some more components.
Do I need to have knowledge about java, mysql , etc or a basic Linux and scripting knowledge will do the trick?

@noci
I don't know about snowtide and pdfxstream. Are they plugins/addons of the lucene/solr ?
I hope they're free and can be used by solr. I'm getting more and more fascinated by solr.
Can you clarify on pros and cons of lucene/solr? what's their difference, if you have experience in using them?

@Mohammed Khawaja
Alfresco is a paid solution, but the mobile app support including iOS and Android is good.
I may go for Community edition, if convinced enough.
Do you have any experience using their software?

@Zberteoc
Do you know a working method by experience?

@theo kouwenhoven
Lutus Agenda seems to be a DOS old solution!

@gheist
@Sjef Bosman
Local search is not intended.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Still waiting for other experts input!

@Mohammed Khawaja
Was there any reason to set Alfresco aside?
You seem to be from Arab countries, and that's perfect. Can those softwares  handle UTF8 word and pdfs? How about arabic?

@Zberteoc
Thank you that was very valuable information. Learned a lot.
Biggest problem with arabic was (and is) rendering, whole alphabet is in very old Unicode releases already. Rendering problem is pretty minor compared to rendering Korean script intermixed with latin letters...

Why to put alfresco aside? Because microsoft promises reduced price and sponsored certifications if you sink in their produce.
Where I work, the company has decided to reduce the number of partners it keeps.  We have few partners (i.e. SAP for ERP, Windows for OS, SQL for DB, SharePoint for collaboration, Kaspersky for anti-virus, etc.).
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of noci
noci

Alfresco still has a community edition, you need a candle to find it, but it is still there.
The community edition is free.

https://www.alfresco.com/alfresco-community-download
Thank you all

@shalomc
Thanks for your comprehensive explanation.

@noci
Yeah, I even saw their comparison chart. I should try it out sometime.