Stripping individual words from a string into a file with counts

I have a column in a database called Description and I want to strip out all the individual words that exist in this column into a separate table and count the occurances.
As part of this process the column with be turned to UPPER(Decsription)
so for example

Description
magento integration
integration magento
amazon and ebay exist
magento/integration

returns temp table with 2 fields

Words                    count
magento                   2
integration               2
amazon                     1
and                             1  
ebay                            1
magento/integration   1
exists                            1

how and is this possible
Chris MichalczukConsultantAsked:
Who is Participating?
 
PortletPaulConnect With a Mentor freelancerCommented:
Please note, the following is a variant of the technique contained in the comment by ste5an above.

There are countless variations of "split string" functions, here's another that I believe to be efficient and simple to use using your sample data and table name as an example.

Sample data:
CREATE TABLE AllOpportunities 
	([Description] varchar(200))
;
	
INSERT INTO AllOpportunities 
	([Description])
VALUES
	('magento integration'),
	('integration magento'),
	('amazon and ebay exist'),
	('magento/integration')
;

Open in new window


Split String function:
CREATE FUNCTION [dbo].[SplitStringIntoTable]  (
    @string varchar(8000),
    @delimiter varchar(5)
    )
RETURNS TABLE WITH SCHEMABINDING 
AS

RETURN
--Inline-CTE-Driven "tally table" produces values from 0 to 9999  ... enough to cover varchar(8000)
WITH cteDigits AS (
    SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
cteTally AS (
    SELECT [1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS num
    FROM cteDigits [1s] --0 thru 9
    CROSS JOIN cteDigits [10s] --thru 99
    CROSS JOIN cteDigits [100s] --thru 999
    CROSS JOIN cteDigits [1000s] --thru 9999
),
cteBase(num) AS ( --Limit the number of rows up front, for both a performance gain and prevention of accidental "overruns"
        SELECT 0 UNION ALL
        SELECT TOP (DATALENGTH(ISNULL(@string, 1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM cteTally
), 
cteStart(num1) AS ( --This returns num + 1 (starting position of each "value" just once for each delimiter)
        SELECT b.num + 1
        FROM cteBase b
        WHERE (SUBSTRING(@string, b.num, 1) = @delimiter OR b.num = 0)
)
--Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
SELECT ROW_NUMBER() OVER(ORDER BY s.num1) AS value_seq,
       LTRIM(RTRIM(SUBSTRING(@string, s.num1, ISNULL(NULLIF(CHARINDEX(@delimiter, @string, s.num1), 0) - s.num1, 8000)))) AS value
FROM cteStart s;

GO

Open in new window


Query:
SELECT
      CA.value
    , COUNT(*) AS COUNT_OF
FROM dbo.AllOpportunities AS AO
      CROSS APPLY dbo.SplitStringIntoTable(AO.description, SPACE(1)) AS CA
WHERE LEN(CA.value) > 1
GROUP BY
      CA.value
ORDER BY
      COUNT_OF DESC
    , CA.value ASC
;

Open in new window


Result:
|               VALUE | COUNT_OF |
|---------------------|----------|
|         integration |        2 |
|             magento |        2 |
| magento/integration |        1 |
|              amazon |        1 |
|                 and |        1 |
|                ebay |        1 |
|               exist |        1 |

Open in new window


see it working here: http://sqlfiddle.com/#!3/e4eb4/3

full details:
**MS SQL Server 2008 Schema Setup**:

    
    
    CREATE TABLE AllOpportunities 
    	([Description] varchar(200))
    ;
    	
    INSERT INTO AllOpportunities 
    	([Description])
    VALUES
    	('magento integration'),
    	('integration magento'),
    	('amazon and ebay exist'),
    	('magento/integration')
    ;
    
    
    GO
    
    CREATE FUNCTION [dbo].[SplitStringIntoTable]  (
        @string varchar(8000),
        @delimiter varchar(5)
        )
    RETURNS TABLE WITH SCHEMABINDING 
    AS
    
    RETURN
    --Inline-CTE-Driven "tally table" produces values from 0 to 9999  ... enough to cover varchar(8000)
    WITH cteDigits AS (
        SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
        SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
    ),
    cteTally AS (
        SELECT [1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS num
        FROM cteDigits [1s] --0 thru 9
        CROSS JOIN cteDigits [10s] --thru 99
        CROSS JOIN cteDigits [100s] --thru 999
        CROSS JOIN cteDigits [1000s] --thru 9999
    ),
    cteBase(num) AS ( --Limit the number of rows up front, for both a performance gain and prevention of accidental "overruns"
            SELECT 0 UNION ALL
            SELECT TOP (DATALENGTH(ISNULL(@string, 1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM cteTally
    ), 
    cteStart(num1) AS ( --This returns num + 1 (starting position of each "value" just once for each delimiter)
            SELECT b.num + 1
            FROM cteBase b
            WHERE (SUBSTRING(@string, b.num, 1) = @delimiter OR b.num = 0)
    )
    --Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
    SELECT ROW_NUMBER() OVER(ORDER BY s.num1) AS value_seq,
           LTRIM(RTRIM(SUBSTRING(@string, s.num1, ISNULL(NULLIF(CHARINDEX(@delimiter, @string, s.num1), 0) - s.num1, 8000)))) AS value
    FROM cteStart s;
    
    GO
    
    

**Query 1**:

    SELECT
          CA.value
        , COUNT(*) AS COUNT_OF
    FROM dbo.AllOpportunities AS AO
          CROSS APPLY dbo.SplitStringIntoTable(AO.description, SPACE(1)) AS CA
    WHERE LEN(CA.value) > 1
    GROUP BY
          CA.value
    ORDER BY
          COUNT_OF DESC
    

**[Results][2]**:
    
    |               VALUE | COUNT_OF |
    |---------------------|----------|
    |         integration |        2 |
    |             magento |        2 |
    | magento/integration |        1 |
    |              amazon |        1 |
    |                 and |        1 |
    |                ebay |        1 |
    |               exist |        1 |



  [1]: http://sqlfiddle.com/#!3/e4eb4/3

Open in new window


refer: PAQ
http://www.experts-exchange.com/Database/MS-SQL-Server/Q_28272656.html#a39589088
0
 
Scott PletcherSenior DBACommented:
Hmm, if we are to use UPPER() on the description, why would you expect the result to come back as lower case?!
0
 
ste5anSenior DeveloperCommented:
hmm, I would guess that you should go for a CLR solution. Otherwise you need a split function:

 
DECLARE @Sample TABLE
    (
      [Description] NVARCHAR(MAX)
    );

INSERT  INTO @Sample
VALUES  ( 'magento integration' ),
        ( 'integration magento' ),
        ( 'amazon and ebay exist' ),
        ( 'magento/integration' );

SELECT  R.Value ,
        COUNT(*)
FROM    @Sample S
        CROSS APPLY dbo.fn_inline_split(S.[Description], ' ') R
GROUP BY R.Value;

Open in new window


with the split function

 
CREATE FUNCTION dbo.fn_inline_split
    (
      @param NVARCHAR(MAX) ,
      @splitchar NCHAR(1) = ','
    )
RETURNS TABLE
AS
RETURN
    ( SELECT    LTRIM(RTRIM(CONVERT(NVARCHAR(4000), SUBSTRING(@param, Number,
                                                              CHARINDEX(@splitchar COLLATE Slovenian_BIN2, @param + CONVERT(NVARCHAR(MAX), @splitchar), Number)
                                                              - Number)))) AS Value
      FROM      dbo.Numbers
      WHERE     Number <= CONVERT(INT, LEN(@param))
                AND SUBSTRING(CONVERT(NVARCHAR(MAX), @splitchar) + @param, Number, 1) = @splitchar COLLATE Slovenian_BIN2
    );

Open in new window


and the number table

CREATE TABLE dbo.Numbers
    (
      Number INT NOT NULL ,
      PRIMARY KEY CLUSTERED ( number ASC )
    );
GO

DECLARE @i INT = 1;

WHILE ( @i < 1000000 )
    BEGIN
        INSERT  INTO dbo.Numbers
        VALUES  ( @i );
        SET @i += 1;
    END;
GO

Open in new window


Caveat: The numbers table must contain as much numbers as the maximum of chars which can occur in on of your values.
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

 
Chris MichalczukConsultantAuthor Commented:
Scott the Upper was a red herring I assumed I'd turn everything into the same case to make the counts easier.
0
 
Chris MichalczukConsultantAuthor Commented:
My table is call AllOpportunities and the data is held in AllOpportunities.description

How do I use the above to achieve a seperate table of words and counts
0
 
Chris MichalczukConsultantAuthor Commented:
thanks
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.