Link to home
Start Free TrialLog in
Avatar of YZlat
YZlatFlag for United States of America

asked on

Function for word by word string comparison in T-SQL

I need to write a SQL Server function that will take in two fields and perform a logical operation in this manner
 - take first field value, split it into words
 - make sure all of these words exist in the second field.

For example: the first field of "le chien" and the second is "le chien joue avec la balle" The function would return true in this case, since all the words from the first one are in the second.

I could simply use SUBSTRING:

SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5)

But in some cases it would not work; if I have field1 as la maison (1) and field2 as la maison (2) SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5) would return true even though the strings do not match.

CHARINDEX(field1, field2) > 0 works but not for all the cases, since it only works for those strings where words are in the same order.
for example it works for "le chien" and "le chien joue avec la balle" but does not for "le chien" and "chien le joue avec la balle"

Increasing number of characters in some cases would work, but would not in others. So it all boils down to writing a function that will split the first string into separate words and ensure that all of them exist in the second string.

Can someone help me build a function or point me in the right direction? I'd really appreciate any help

P.S> I need something that I can just plug in to the select statement as a condition in a where clause
Avatar of dsacker
dsacker
Flag of United States of America image

I wrote a function sometime back called udf_SplitFn. You might like it. It is zero-based, so for the first and second words, you would call it as follows:

SELECT  *
FROM   (SELECT  KeyFields,
                dbo.udf_SplitFn(' ', FirstField, 0) AS FirstWord,
                dbo.udf_SplitFn(' ', FirstField, 1) AS SecondWord,
                SecondField
        FROM    YourTable) t
WHERE   CHARINDEX(FirstWord, SecondField) > 0
AND     CHARINDEX(SecondWord, SecondField) > 0

Open in new window

If you want both words to be there together, you can replace the above WHERE statement with:

WHERE   CHARINDEX(FirstWord + ' ' + SecondWord, SecondField) > 0
ASKER CERTIFIED SOLUTION
Avatar of dsacker
dsacker
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Much easier if you have a key value with original strings, some type of identifier.  Then you just need the "standard" splitter, a CROSS APPLY and a GROUP BY.  

I'll assume for now that space is enough of a word delimiter.  That is, that you do not have to worry about:
"le chien, ..."

But, if necessary, you could run each Item returned from the second string through a function that would remove commas and periods and/or other chars, such as colons, dashes, etc.

SELECT id, MAX(string1) AS string1, MAX(string2) AS string2
FROM (
    SELECT 1 AS id, 'le chien' AS string1, 'le chien joue avec la balle' AS string2 UNION ALL
    SELECT 2, 'le chien', 'chien le joue avec la balle' UNION ALL
    SELECT 3, 'le chat', 'le chien joue avec la balle blanche' UNION ALL
    SELECT 4, 'das hund', 'le chien joue avec la balle blanche'
) AS test_data
CROSS APPLY (
    SELECT s1.Item, s2.Item AS Item2
    FROM dbo.DelimitedSplit8K ( string1, ' ' ) s1
    LEFT OUTER JOIN dbo.DelimitedSplit8K ( string2, ' ' ) AS s2 ON
        s2.Item = s1.Item
) AS ca
GROUP BY id
HAVING SUM(CASE WHEN Item2 IS NULL THEN 1 ELSE 0 END) = 0
YZlat, do you still need help with this question
Avatar of YZlat

ASKER

sorry, I was pulled off to something else. Will test the solutions offered today
Avatar of YZlat

ASKER

Didn't get a chance to test it out but won't in a while, so I am just going to accept those answers that could work out - functions. Other answers won't work because I need to use that function in a query join