asked on

How to develop online dictionary

I have a small translation agency and have accumulated various bilingual glossaries over the past few years (about 1 million records; German<>English, Spanish<>English). Now I want to make them available online on my website as a gadget to attracked more customers. I really like the results output to be similar in appearance to leo.org or dict.cc (they display results in different categories and allow fuzzy matching) We also want to restrict the maximum number of outputs (e.g. to 100; esp. when running a wildcard query) and need to take care of some security issues (crawlers that copy the content of our database, which is proprietory; about 1 million records).

I think at least one of the above two sites uses a weblimpse script (webglimpse.net) to run a fuzzy match query on the search term (the vocabulary) that is entered. Some info is given here (http://www.utils.ex.ac.uk/german/dict/), but I don't know how to handle this information (too general).
I also don't know how to tackle the results part and the security issues.

Any feedback on this is appreciated.
Thank you in advance!

Regards,
Ray

SOLUTION

ahoffmann

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

RayTX

ASKER

Thanks for your comment on this!!

Yes, I've been using leo for my daily work since 1998 or '99, and - as far as it concerns speed and user-friendlyness - the site is my personal favorite.

It is not my primary objective to create something faster than leo (wouldn't mind of course), and scaleability also would only be of secondary importance, since we try to target a rather small amount of people (visitors of our site who can be turned into new prospects and clients).

Sorry about the strange 'security' question: what I meant are security holes that can allow remote users to read content of files stored on my server (I'm planning to add the future solution to my current 1&1 webhosting service).

Example: the guy who runs dict.cc complains about (apparently successful) attacks on his site Some folks copied the whole content of his database. Others accused him of having done the same.
It took me more than eight years to accumulate the data that I am talking about, and I'm willing to allow small peaks (20 to 50 records per query), but don't want to give the whole stuff away - as a result of a deliberate attack - in a matter minutes or hours.

I am not well versed in the field of web development. I hope that your and the other experts's feedback can provide me with some sort of knowledge base when it comes to hiring a programmer for this (I definitely won't be able to do it).
The offers I received prior to posting my question on this site differed quite a lot ( between $750 and $10,000).

SOLUTION

ahoffmann

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

mrcoffee365

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

ahoffmann

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

RayTX

ASKER

(sorry for the late comment - had to leave town for a while)

First of all many thanks to ahoffmann and mrcoffee365.
Your comments have been very helpful to me (and hopefully also for some future readers of this blog).

Since I don't want to spend a fortune on this dictionary tool, I will probably go for a db setup and queries, depending on how much flexibility a db query accepts and if db queries allow for fuzzy searches (like Glimpse and Lucene do).

As for the security issues I will probably use a scaled-down version of my dictionary database for unregistered users and will limit the wildcard queries to 20 to 50 results (will have to do some testing on that). No IP blocking, due to the trade-offs that ahoffmann mentioned in his/her 2-Feb-2008 comment.

Registered users will have access to my extended database (that includes all the good stuff, which I don't want to be 'stolen'). They will only be allowed to run a fixed number of queries per time intervall to prevent stealing of the glossaries either manually or attacks through scripts that could use the users login information. For heavy users CAPTCHAs could be integrated (depening on how complex this would be to implement).

The idea of adding key phrases to the content seems simple and effective when it comes to prooving the proprietary of the content.

Last but not least I would like to thank ahoffmann and mrcoffee365 again for their input.
I very much benefited from both party's comments and would like to split the amount of expert points 50-50.

Regards,
Ray

RayTX

ASKER

My general knowledge of the technological issues that were discussed in this thread is not sufficient to grade the first two questions. However, since an answer is required I marked them "Yes". "n.a." (not applicable) would have been a more appropriate.

mrcoffee365

Thanks. Good luck with your project.