• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 384
  • Last Modified:

How to develop online dictionary

I have a small translation agency and have accumulated various bilingual glossaries over the past few years (about 1 million records; German<>English, Spanish<>English). Now I want to make them available online on my website as a gadget to attracked more customers. I really like the results output to be similar in appearance to leo.org or dict.cc (they display results in different categories and allow fuzzy matching) We also want to restrict the maximum number of outputs (e.g. to 100; esp. when running a wildcard query) and need to take care of some security issues (crawlers that copy the content of our database, which is proprietory; about 1 million records).

I think at least one of the above two sites uses a weblimpse script (webglimpse.net) to run a fuzzy match query on the search term (the vocabulary) that is entered. Some info is given here (http://www.utils.ex.ac.uk/german/dict/), but I don't know how to handle this information (too general).
I also don't know how to tackle the results part and the security issues.

Any feedback on this is appreciated.
Thank you in advance!

Regards,
Ray
0
RayTX
Asked:
RayTX
  • 3
  • 3
  • 2
4 Solutions
 
ahoffmannCommented:
Fuzzy search in natural language is a realy sophisticated task. Are you aware that leo.org is online since more than 10 years and based on former studies and projects at LMU? Are you realy trying to make this work in a couple of months yourself, I doubt.

Said this, I'l focus on your other questions:
  - .. want to restrict the maximum number of outputs
    well, this is a simple task and just depends on your used database and programming skills

  - .. take care of some security issues
    this is a strange question: a crawler is not a security issue, or do you mean property issues (like copyright, trademark)?
   For the latter you better consult a lawyer, as it is more a social/judical problem than a technical. Technical it is still (nearly) impossible to restrict your content to anyone except crawlers. The only reliable solution is to ask for a password for each query, anything else is breakable by automated scripts, somehow.

  - .. above two sites uses a weblimpse script ..
    I doubt, see http://dict.leo.org/about.html 

  - .. security issues.
    if you mean web aaplication security, such as vulnerabilities and threats, let me know where you need help
0
 
RayTXAuthor Commented:
Thanks for your comment on this!!

Yes, I've been using leo for my daily work since 1998 or '99, and - as far as it concerns speed and user-friendlyness - the site is my personal favorite.

It is not my primary objective to create something faster than leo (wouldn't mind of course), and scaleability also would only be of secondary importance, since we try to target a rather small amount of people (visitors of our site who can be turned into new prospects and clients).

Sorry about the strange 'security' question: what I meant are security holes that can allow remote users to read content of files stored on my server (I'm planning to add the future solution to my current 1&1 webhosting service).

Example: the guy who runs dict.cc complains about (apparently successful) attacks on his site Some folks copied the whole content of his database. Others accused him of having done the same.
It took me more than eight years to accumulate the data that I am talking about, and I'm willing to allow small peaks (20 to 50 records per query), but don't want to give the whole stuff away - as a result of a deliberate attack - in a matter minutes or hours.

I am not well versed in the field of web development. I hope that your and the other experts's feedback can provide me with some sort of knowledge base when it comes to hiring a programmer for this (I definitely won't be able to do it).
The offers I received prior to posting my question on this site differed quite a lot ( between $750 and $10,000).
0
 
ahoffmannCommented:
> .. are security holes that can allow remote users to read content of files stored on my server
ok, then we're in the business of web aaplication security ;-)
I'd first recommend to get a webspace which is *not* a name-based virtual hosts, but one with a dedicatd IP.
Then you need a hardend web server. After that you best make your programs/scripts  (cgi-bin or whatever) secure by design, means that you make the security part of your design and not try to adopt at the end right before you go life.
Web application security is rarely understood be programmers, admnistrators and web hosting providers/ISPs, I'd recommend to hire professional service here, which is very expensive (I guess starting somewhere at ¬150 per hour).

> I am not well versed in the field of web development.
No offence at all, but then you're prone to catch all traps about web aaplication vulnerabilities. If you realy care about your data, then make yourself used to the threats first. Keep in mind that web application security covers anything starting at network level, affects the operating system, obviously the the web server (software) and its scripts, as well as some logical attacks based on the logic of your program flow (including authentication/authorisation/permissions).


0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
mrcoffee365Commented:
Very interesting question.  ahoffmann has made excellent comments.

After looking at www.leo.org, it looks as if they decided to build their search in-house.  However, had I not seen that page, I would have assumed that they used a standard text search engine, possibly even Lucene, which is free.  Their translations are simple word-based offerings, which is easily done with a text search engine, or even a plain db search.

Still, they've done a nice job, with an interesting display for results.

You don't want to offer a dictionary or translation software product for sale, right?  So your goals with your Web site are a little different.  And I think you're right to worry about someone downloading your glossaries -- unscrupulous people seem to do that sort of thing regularly.

So you want a Web app for submitting queries to your glossaries in a fun tools area on your site, and you want some controls to prevent your glossaries from being stolen.

You could certainly do something similar to what leo.org has done, and provide a text search engine lookup to your glossary.  However, just to show a few phrases, I don't think it's necessary.  A more simple db lookup would be fine.  The reason I suggest looking at simpler lookups is that it is more development time and expertise to set up a text search than a db query.

The Web form query and display is something any reasonably competent Web developer should be able to do.  Db setup and queries are more difficult.  Text search setup and applications are much more difficult.

In terms of controls to prevent your glossaries from being stolen:
1)  Having the results be displayed after query prevents normal crawling from stealing your glossaries.  Crawling only works through Web links.
2)  Limiting wildcards in your queries (or the results you will return -- to say, 10 or 20 -- will help with an individual stealing your glossary by hand.
3)  Automated data stealing might be of greater concern.  Someone with a program and a Web scraper could submit a query for each word in their own dictionary and collect your results, automatically.  To prevent this, you could do what Google does -- only allow a limited number of requests in a time period from the same IP address.  For example, only 5 requests within a 1 or 2 minute period.  Or only allow 20 queries total from any IP address in a 24 hour period.

All of the protections from automated stealing require custom programming.  There are a few Web servers I've seen that will offer some limits on access by IP address, and there are some expensive routers that have that, too, so that could augment your programming.

As ahoffmann said, your final recourse might be suing someone, which would mean figuring out how to identify your results as yours.  In the cases of content theft that I've seen, the automatic content thieves have left the content just as it is, so if you include a key phrase in all of your answers, it's quite possible that the phrase would identify your content as yours.  Of course, then you have to figure out how to get them to comply with a cease and desist order which might only work in North America, Western Europe, and possibly Australia and Japan.  But there are plenty of content thieves in those countries as well, so it might be worth it.

With all of this, I think you're looking at a minimum of the higher proposal on how much it will cost to develop your site.  
0
 
ahoffmannCommented:
to follow up/comment on mrcoffee365:

> .. Crawling only works through Web links.
that's true for search engines' crawlers, but here we're talking about manuall crafted crawlers to attack/penetrate the specific site. In this case you have to implement special protections. They must not be 101% bullet prove, it may be enough if they just raise the bar.

> .. 2)  Limiting wildcards in your queries (or the results you will return -- to say, 10 or 20 -- will help with an individual stealing your glossary by hand.
I guess, no, see previous comment about crafted queries.
It's simple to write little script (3-10 lines shell code) to perform a continous page query to gather all data for a specific query. Some crawlers (for webapp testing, fuzzing) are still avaialbe which can do that out-of-the-box.

> 3) ..
I agree with the limiting aproach i.g., but in detail you run into a couple of problems. Restriction to IP can be circumvented by using anon proxys, simply.
If the IP bar has been raised, the other burst values (10 queries per minute) will no longer match.
CAPTCHAs are a resonable protection against this, at least 'til the current efforts for breaking them automatically are not improving much. And you still have the disadvantage for the user to key in something special.

Currently the best known protection against such crawlers and data theft is some kind of authentication, which requieres user intervention (registration, password) or a PKI (client certificates). That's the high end protection.

Accoding mrcoffee365's suggestion about using a db, I'd also use a db, let's say MySQL, Postgres, ... That's sufficient for some millions of records, and you have simple wildcard searches. The performance should also be not too bad for your purpose. Most web languages (java, perl, php, tcl, ...) have APIs for it, so you don't need to re-envent the wheel.
0
 
RayTXAuthor Commented:
(sorry for the late comment - had to leave town for a while)

First of all many thanks to ahoffmann and mrcoffee365.
Your comments have been very helpful to me (and hopefully also for some future readers of this blog).

Since I don't want to spend a fortune on this dictionary tool, I will probably go for a db setup and queries, depending on how much flexibility a db query accepts and if db queries allow for fuzzy searches (like Glimpse and Lucene do).

As for the security issues I will probably use a scaled-down version of my dictionary database for unregistered users and will limit the wildcard queries to 20 to 50 results (will have to do some testing on that). No IP blocking, due to the trade-offs that ahoffmann mentioned in his/her 2-Feb-2008 comment.

Registered users will have access to my extended database (that includes all the good stuff, which I don't want to be 'stolen'). They will only be allowed to run a fixed number of queries per time intervall to prevent stealing of the glossaries either manually or attacks through scripts that could use the users login information. For heavy users CAPTCHAs could be integrated (depening on how complex this would be to implement).

The idea of adding key phrases to the content seems simple and effective when it comes to prooving the proprietary of the content.

Last but not least I would like to thank ahoffmann and mrcoffee365 again for their input.
I very much benefited from both party's comments and would like to split the amount of expert points 50-50.

Regards,
Ray
0
 
RayTXAuthor Commented:
My general knowledge of the technological issues that were discussed in this thread is not sufficient to grade the first two questions. However, since an answer is required I marked them "Yes". "n.a." (not applicable) would have been a more appropriate.
0
 
mrcoffee365Commented:
Thanks.  Good luck with your project.
0

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

  • 3
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now