Link to home
Start Free TrialLog in
Avatar of lassetyr
lassetyr

asked on

Similarity search - PL/SQL

Hi all!
I am developing a database containg hundreds of thousands of rows - will increase to over a million within the year.
They each have a 'title'-value which is possible to perform a search on. However, I am looking for a way to provide a search that is not an exact match (either on the whole field or parts of the field), but som sort of query that returns all rows that are similar. This is to allow for users to write the search criteria like they think it is written (for instance a search for 'misisipi' should return 'Mississippi'). The title-field is a varchar2 column.

I have installed the Intermedia Context-option (on Oracle 8.1.7), but I can't seem to make it work properly. As far as I have understood, this is supposed to be what I am looking for.

I have also tried to use a procedure that calculates the number of characters that must be altered to make a match (the Levenshtein Distance) - details can be found at http://www.merriampark.com/ld.htm , which is excactly what I want, but it is far too slow in my case due to the amount of data.

Any ideas?

Help will be greatly appreciated!!!
Avatar of venkotch
venkotch

Probably these examples may help ...

SELECT resume
  FROM emp WHERE CONTAINS (resume, 'DBA') > 0;

-- Order results by fuzzy score (range: 0 to 80)
SELECT SCORE(1), empname FROM emp
 WHERE CONTAINS(resume, 'dog', 1 ) > 0
 ORDER BY SCORE(1) DESC;

-- Use ABOUT to narrow things down
SELECT empname, resume FROM emp
 WHERE  CONTAINS (resume, 'about(EJB)', 1) > 0;

-- $ = Stem searching (eg, $sing: sing, sang, sung)
SELECT empname, resume FROM emp
 WHERE  CONTAINS (resume, '$consultant', 1) > 0;

-- ? = Fuzzy matching (dog, doug, dojo)
SELECT empname, resume FROM emp
 WHERE  CONTAINS (resume, '?developer', 1) > 0;

I've got them from
http://www.orafaq.org/faqctx.htm

Hi lassetyr,

You need not worry about the Levenshtein distance algorithm and the implementation of similar stuff. Use the advance search facility in Oracle; use fuzzy, stem ... etc. Oracle will do it for you.

For more details check out :

http://metalink.oracle.com/help/usaeng/Search/search.html


~Sanal
Hi lassetyr,
I think you did most of the work. Try partition your table as if a dictionary if you can.
use the fuzzy search query operator on your CONTEXT type
index.  

select score(1), title, id
  from all_titles /* don't know your table name */
 where contains(title,
                '?misisipi',  
                1)
 order by score(1) desc            
Avatar of lassetyr

ASKER

The problem is that the fuzzy-search included in Oracle is for English language text only (please tell me if I am wrong) - most of the title-fields are in Norwegian, and thus the fuzzy query returns unnatural results due to the linguistic differences.

Am I wrong? Is the fuzzy search available for several languages?

The reason I started on the Levenshtein distance algorithm is that it is language independent - it only looks at the separate characters in the strings...

Found a solution to provide support several languages on http://ccm.redhat.com/bboard-archive/webdb/000cQl.html

Thanks for the help anyway...
ASKER CERTIFIED SOLUTION
Avatar of SpideyMod
SpideyMod

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial