lassetyr
asked on
Similarity search - PL/SQL
Hi all!
I am developing a database containg hundreds of thousands of rows - will increase to over a million within the year.
They each have a 'title'-value which is possible to perform a search on. However, I am looking for a way to provide a search that is not an exact match (either on the whole field or parts of the field), but som sort of query that returns all rows that are similar. This is to allow for users to write the search criteria like they think it is written (for instance a search for 'misisipi' should return 'Mississippi'). The title-field is a varchar2 column.
I have installed the Intermedia Context-option (on Oracle 8.1.7), but I can't seem to make it work properly. As far as I have understood, this is supposed to be what I am looking for.
I have also tried to use a procedure that calculates the number of characters that must be altered to make a match (the Levenshtein Distance) - details can be found at http://www.merriampark.com/ld.htm , which is excactly what I want, but it is far too slow in my case due to the amount of data.
Any ideas?
Help will be greatly appreciated!!!
I am developing a database containg hundreds of thousands of rows - will increase to over a million within the year.
They each have a 'title'-value which is possible to perform a search on. However, I am looking for a way to provide a search that is not an exact match (either on the whole field or parts of the field), but som sort of query that returns all rows that are similar. This is to allow for users to write the search criteria like they think it is written (for instance a search for 'misisipi' should return 'Mississippi'). The title-field is a varchar2 column.
I have installed the Intermedia Context-option (on Oracle 8.1.7), but I can't seem to make it work properly. As far as I have understood, this is supposed to be what I am looking for.
I have also tried to use a procedure that calculates the number of characters that must be altered to make a match (the Levenshtein Distance) - details can be found at http://www.merriampark.com/ld.htm , which is excactly what I want, but it is far too slow in my case due to the amount of data.
Any ideas?
Help will be greatly appreciated!!!
Hi lassetyr,
You need not worry about the Levenshtein distance algorithm and the implementation of similar stuff. Use the advance search facility in Oracle; use fuzzy, stem ... etc. Oracle will do it for you.
For more details check out :
http://metalink.oracle.com/help/usaeng/Search/search.html
~Sanal
Hi lassetyr,
I think you did most of the work. Try partition your table as if a dictionary if you can.
I think you did most of the work. Try partition your table as if a dictionary if you can.
use the fuzzy search query operator on your CONTEXT type
index.
select score(1), title, id
from all_titles /* don't know your table name */
where contains(title,
'?misisipi',
1)
order by score(1) desc
index.
select score(1), title, id
from all_titles /* don't know your table name */
where contains(title,
'?misisipi',
1)
order by score(1) desc
ASKER
The problem is that the fuzzy-search included in Oracle is for English language text only (please tell me if I am wrong) - most of the title-fields are in Norwegian, and thus the fuzzy query returns unnatural results due to the linguistic differences.
Am I wrong? Is the fuzzy search available for several languages?
The reason I started on the Levenshtein distance algorithm is that it is language independent - it only looks at the separate characters in the strings...
Am I wrong? Is the fuzzy search available for several languages?
The reason I started on the Levenshtein distance algorithm is that it is language independent - it only looks at the separate characters in the strings...
ASKER
Found a solution to provide support several languages on http://ccm.redhat.com/bboard-archive/webdb/000cQl.html
Thanks for the help anyway...
Thanks for the help anyway...
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SELECT resume
FROM emp WHERE CONTAINS (resume, 'DBA') > 0;
-- Order results by fuzzy score (range: 0 to 80)
SELECT SCORE(1), empname FROM emp
WHERE CONTAINS(resume, 'dog', 1 ) > 0
ORDER BY SCORE(1) DESC;
-- Use ABOUT to narrow things down
SELECT empname, resume FROM emp
WHERE CONTAINS (resume, 'about(EJB)', 1) > 0;
-- $ = Stem searching (eg, $sing: sing, sang, sung)
SELECT empname, resume FROM emp
WHERE CONTAINS (resume, '$consultant', 1) > 0;
-- ? = Fuzzy matching (dog, doug, dojo)
SELECT empname, resume FROM emp
WHERE CONTAINS (resume, '?developer', 1) > 0;
I've got them from
http://www.orafaq.org/faqctx.htm