Link to home
Start Free TrialLog in
Avatar of PeterBaileyUk
PeterBaileyUk

asked on

Sql Server Dedupe function

I need to dedupe a string made from concatenated fields before I send the string off to be split.

The leftmost word needs to be kept of the dupes.

'XF XF 250 ATV-2 TERRAHAWK' and this 'XF 250 ATV-2 XF TERRAHAWK XF '

would become ''XF 250 ATV-2 TERRAHAWK''

For now I only want to dedupe actual words XF XF and not ++ but that position may change

I am in Sql server express latest edition.
Avatar of Sharath S
Sharath S
Flag of United States of America image

If it is just that specific word XF, you can try something like this.

SELECT CASE WHEN col LIKE 'XF%' THEN 'XF' ELSE '' END + REPLACE(col, 'XF','') FROM your_table
Avatar of PeterBaileyUk
PeterBaileyUk

ASKER

I think on reflection its the concatenation of fields I have that is creating dupes so I wonder if a better concat can be made. The example I gave is just on one client code but to get to the data which is spread across 3 fields (for this one client with no consistency in the data).

I have
CONCAT(
						 ISNULL(RTRIM(LTRIM([Model Range]))+' ','')
						,ISNULL(RTRIM(LTRIM([RANGE SERIES]))+' ','')
						,ISNULL(RTRIM(LTRIM([VARIANT])),'')

Open in new window


I know the dupes are getting created here.

for the terrahawk code I gave the fields are like: [Model Range]='XF' + [RANGE SERIES]='' + [VARIANT]='XF 250 ATV-2 TERRAHAWK' so is it possible to create a function that will concat but not concat where it sees a dupe as opposed to creating dupes and then retrospectively removing them.
You can use the same function I gave you in the other question:
declare
	@str varchar(1000)='XF XF 250 ATV-2 TERRAHAWK'	/* comment/uncomment this line to toggle between the 2 string versions
	@str varchar(1000)='XF 250 ATV-2 XF TERRAHAWK XF'	--*/

select
	*,
	(select count(*) from [dbo].[fnParseStringToSet](@str,' ') where Elem=f.Elem) dupe
from
	[dbo].[fnParseStringToSet](@str,' ') f

Open in new window

The problem here is which one, what position, you keep among the repeating words.
ASKER CERTIFIED SOLUTION
Avatar of Zberteoc
Zberteoc
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
do i calll it at the top select or at the at the concat part?

I just tried the concat part but it said wrong number of arguments.
Select 
	 ClientCodeWordPosition + '_' + cast(d.ElemID as varchar(5))
   , [MVRIS CODE]
   , ClientName
   , UPPER(Word)
   , WordLength AS WordLen
   , ConcatDesc AS strfull
	, NoOfWords
   , MARQUE 
From 
(
	SELECT  
	     smmt.[MVRIS CODE] + '_' + s.Elem + '_' AS ClientCodeWordPosition
	   , SMMT.[MVRIS CODE]
	   , 'SMMT' AS ClientName
	   , s.Elem AS [Word]
	   , LEN(s.Elem) AS WordLength
	   , ConcatDesc
	   , count(*)  over (partition by SMMT.[MVRIS CODE]) NoOfWords
	   , SMMT.MARQUE
	   , s.ElemID
	FROM 
		(
			SELECT 
				 [MVRIS CODE]	
				,[MARQUE]
				,ltrim(rtrim(
					CONCAT(
						 ISNULL(RTRIM(LTRIM([Model Range]))+' ','')
						,ISNULL(RTRIM(LTRIM([RANGE SERIES]))+' ','')
						,ISNULL(RTRIM(LTRIM([VARIANT])),'')
					 ) 
				 )) as ConcatDesc
			FROM
				ClientData.[dbo].SMMT
		) SMMT
		CROSS APPLY [dbo].[fnParseStringToSet](ConcatDesc,' ') s 
	WHERE 
		SMMT.[MVRIS CODE]='1B1AD' 
) D

Open in new window

I got it:
FROM 
		(
			SELECT 
				 [MVRIS CODE]	
				,[MARQUE]
				,dbo.fnDedupeString(ltrim(rtrim(
					CONCAT(
						 ISNULL(RTRIM(LTRIM([Model Range]))+' ','')
						,ISNULL(RTRIM(LTRIM([RANGE SERIES]))+' ','')
						,ISNULL(RTRIM(LTRIM([VARIANT])),'')
					 ) 
				 )), ' ') as ConcatDesc

Open in new window

Thank you!
OK, I just realized that there is no need for that function in this query. What you want is to eliminate the duplicate words from the rows it returns. You only need an extra condition in the where clause:
Select 
	 ClientCodeWordPosition + '_' + cast(d.ElemID as varchar(5))
   , [MVRIS CODE]
   , ClientName
   , UPPER(Word) as Word
   , WordLength AS WordLen
   , ConcatDesc AS strfull
	, NoOfWords
   , MARQUE 
From 
(
	SELECT  
	     smmt.[MVRIS CODE] + '_' + s.Elem + '_' AS ClientCodeWordPosition
	   , SMMT.[MVRIS CODE]
	   , 'SMMT' AS ClientName
	   , s.Elem AS [Word]
	   , LEN(s.Elem) AS WordLength
	   , ConcatDesc
	   , count(*)  over (partition by SMMT.[MVRIS CODE]) NoOfWords
	   , SMMT.MARQUE
	   , s.ElemID
	FROM 
		(
			SELECT 
				 [MVRIS CODE]	
				,[MARQUE]
				,ltrim(rtrim(
					CONCAT(
						 ISNULL(RTRIM(LTRIM([Model Range]))+' ','')
						,ISNULL(RTRIM(LTRIM([RANGE SERIES]))+' ','')
						,ISNULL(RTRIM(LTRIM([VARIANT])),'')
					 ) 
				 )) as ConcatDesc
			FROM
				ClientData.[dbo].SMMT
		) SMMT
		CROSS APPLY [dbo].[fnParseStringToSet](ConcatDesc,' ') s 
	WHERE 
		SMMT.[MVRIS CODE]='1B1AD' 
		-- ### filter out duplicate words
		and (select count(*) from [dbo].[fnParseStringToSet](@str,' ') where Elem=s.Elem and ElemID<s.ElemID)=0
) D

Open in new window

However, you can keep the dedupe function for different situations.
Yes, the query as you posted will also work. Use the query that is more efficient.
One correction, if you use my last query. The new where clause should be:
...
	WHERE 
		SMMT.[MVRIS CODE]='1B1AD' 
		-- ### filter out duplicate words
		and (select count(*) from [dbo].[fnParseStringToSet](ConcatDesc,' ') where Elem=s.Elem and ElemID<s.ElemID)=0

Open in new window

I would opt for this qyery as is doing less parsing than using the new function.