YZlat
asked on
Function for word by word string comparison in T-SQL
I need to write a SQL Server function that will take in two fields and perform a logical operation in this manner
- take first field value, split it into words
- make sure all of these words exist in the second field.
For example: the first field of "le chien" and the second is "le chien joue avec la balle" The function would return true in this case, since all the words from the first one are in the second.
I could simply use SUBSTRING:
SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5)
But in some cases it would not work; if I have field1 as la maison (1) and field2 as la maison (2) SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5) would return true even though the strings do not match.
CHARINDEX(field1, field2) > 0 works but not for all the cases, since it only works for those strings where words are in the same order.
for example it works for "le chien" and "le chien joue avec la balle" but does not for "le chien" and "chien le joue avec la balle"
Increasing number of characters in some cases would work, but would not in others. So it all boils down to writing a function that will split the first string into separate words and ensure that all of them exist in the second string.
Can someone help me build a function or point me in the right direction? I'd really appreciate any help
P.S> I need something that I can just plug in to the select statement as a condition in a where clause
- take first field value, split it into words
- make sure all of these words exist in the second field.
For example: the first field of "le chien" and the second is "le chien joue avec la balle" The function would return true in this case, since all the words from the first one are in the second.
I could simply use SUBSTRING:
SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5)
But in some cases it would not work; if I have field1 as la maison (1) and field2 as la maison (2) SUBSTRING(field1, 0, 5)=SUBSTRING(field2, 0,5) would return true even though the strings do not match.
CHARINDEX(field1, field2) > 0 works but not for all the cases, since it only works for those strings where words are in the same order.
for example it works for "le chien" and "le chien joue avec la balle" but does not for "le chien" and "chien le joue avec la balle"
Increasing number of characters in some cases would work, but would not in others. So it all boils down to writing a function that will split the first string into separate words and ensure that all of them exist in the second string.
Can someone help me build a function or point me in the right direction? I'd really appreciate any help
P.S> I need something that I can just plug in to the select statement as a condition in a where clause
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Much easier if you have a key value with original strings, some type of identifier. Then you just need the "standard" splitter, a CROSS APPLY and a GROUP BY.
I'll assume for now that space is enough of a word delimiter. That is, that you do not have to worry about:
"le chien, ..."
But, if necessary, you could run each Item returned from the second string through a function that would remove commas and periods and/or other chars, such as colons, dashes, etc.
SELECT id, MAX(string1) AS string1, MAX(string2) AS string2
FROM (
SELECT 1 AS id, 'le chien' AS string1, 'le chien joue avec la balle' AS string2 UNION ALL
SELECT 2, 'le chien', 'chien le joue avec la balle' UNION ALL
SELECT 3, 'le chat', 'le chien joue avec la balle blanche' UNION ALL
SELECT 4, 'das hund', 'le chien joue avec la balle blanche'
) AS test_data
CROSS APPLY (
SELECT s1.Item, s2.Item AS Item2
FROM dbo.DelimitedSplit8K ( string1, ' ' ) s1
LEFT OUTER JOIN dbo.DelimitedSplit8K ( string2, ' ' ) AS s2 ON
s2.Item = s1.Item
) AS ca
GROUP BY id
HAVING SUM(CASE WHEN Item2 IS NULL THEN 1 ELSE 0 END) = 0
I'll assume for now that space is enough of a word delimiter. That is, that you do not have to worry about:
"le chien, ..."
But, if necessary, you could run each Item returned from the second string through a function that would remove commas and periods and/or other chars, such as colons, dashes, etc.
SELECT id, MAX(string1) AS string1, MAX(string2) AS string2
FROM (
SELECT 1 AS id, 'le chien' AS string1, 'le chien joue avec la balle' AS string2 UNION ALL
SELECT 2, 'le chien', 'chien le joue avec la balle' UNION ALL
SELECT 3, 'le chat', 'le chien joue avec la balle blanche' UNION ALL
SELECT 4, 'das hund', 'le chien joue avec la balle blanche'
) AS test_data
CROSS APPLY (
SELECT s1.Item, s2.Item AS Item2
FROM dbo.DelimitedSplit8K ( string1, ' ' ) s1
LEFT OUTER JOIN dbo.DelimitedSplit8K ( string2, ' ' ) AS s2 ON
s2.Item = s1.Item
) AS ca
GROUP BY id
HAVING SUM(CASE WHEN Item2 IS NULL THEN 1 ELSE 0 END) = 0
YZlat, do you still need help with this question
ASKER
sorry, I was pulled off to something else. Will test the solutions offered today
ASKER
Didn't get a chance to test it out but won't in a while, so I am just going to accept those answers that could work out - functions. Other answers won't work because I need to use that function in a query join
Open in new window
If you want both words to be there together, you can replace the above WHERE statement with: