Solved

Stripping individual words from a string into a file with counts

Posted on 2015-02-06
6
63 Views
Last Modified: 2016-03-02
I have a column in a database called Description and I want to strip out all the individual words that exist in this column into a separate table and count the occurances.
As part of this process the column with be turned to UPPER(Decsription)
so for example

Description
magento integration
integration magento
amazon and ebay exist
magento/integration

returns temp table with 2 fields

Words                    count
magento                   2
integration               2
amazon                     1
and                             1  
ebay                            1
magento/integration   1
exists                            1

how and is this possible
0
Comment
Question by:chrismichalczuk
6 Comments
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40593925
Hmm, if we are to use UPPER() on the description, why would you expect the result to come back as lower case?!
0
 
LVL 32

Expert Comment

by:ste5an
ID: 40593948
hmm, I would guess that you should go for a CLR solution. Otherwise you need a split function:

 
DECLARE @Sample TABLE
    (
      [Description] NVARCHAR(MAX)
    );

INSERT  INTO @Sample
VALUES  ( 'magento integration' ),
        ( 'integration magento' ),
        ( 'amazon and ebay exist' ),
        ( 'magento/integration' );

SELECT  R.Value ,
        COUNT(*)
FROM    @Sample S
        CROSS APPLY dbo.fn_inline_split(S.[Description], ' ') R
GROUP BY R.Value;

Open in new window


with the split function

 
CREATE FUNCTION dbo.fn_inline_split
    (
      @param NVARCHAR(MAX) ,
      @splitchar NCHAR(1) = ','
    )
RETURNS TABLE
AS
RETURN
    ( SELECT    LTRIM(RTRIM(CONVERT(NVARCHAR(4000), SUBSTRING(@param, Number,
                                                              CHARINDEX(@splitchar COLLATE Slovenian_BIN2, @param + CONVERT(NVARCHAR(MAX), @splitchar), Number)
                                                              - Number)))) AS Value
      FROM      dbo.Numbers
      WHERE     Number <= CONVERT(INT, LEN(@param))
                AND SUBSTRING(CONVERT(NVARCHAR(MAX), @splitchar) + @param, Number, 1) = @splitchar COLLATE Slovenian_BIN2
    );

Open in new window


and the number table

CREATE TABLE dbo.Numbers
    (
      Number INT NOT NULL ,
      PRIMARY KEY CLUSTERED ( number ASC )
    );
GO

DECLARE @i INT = 1;

WHILE ( @i < 1000000 )
    BEGIN
        INSERT  INTO dbo.Numbers
        VALUES  ( @i );
        SET @i += 1;
    END;
GO

Open in new window


Caveat: The numbers table must contain as much numbers as the maximum of chars which can occur in on of your values.
0
 

Author Comment

by:chrismichalczuk
ID: 40593959
Scott the Upper was a red herring I assumed I'd turn everything into the same case to make the counts easier.
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:chrismichalczuk
ID: 40593978
My table is call AllOpportunities and the data is held in AllOpportunities.description

How do I use the above to achieve a seperate table of words and counts
0
 
LVL 48

Accepted Solution

by:
PortletPaul earned 500 total points
ID: 40596413
Please note, the following is a variant of the technique contained in the comment by ste5an above.

There are countless variations of "split string" functions, here's another that I believe to be efficient and simple to use using your sample data and table name as an example.

Sample data:
CREATE TABLE AllOpportunities 
	([Description] varchar(200))
;
	
INSERT INTO AllOpportunities 
	([Description])
VALUES
	('magento integration'),
	('integration magento'),
	('amazon and ebay exist'),
	('magento/integration')
;

Open in new window


Split String function:
CREATE FUNCTION [dbo].[SplitStringIntoTable]  (
    @string varchar(8000),
    @delimiter varchar(5)
    )
RETURNS TABLE WITH SCHEMABINDING 
AS

RETURN
--Inline-CTE-Driven "tally table" produces values from 0 to 9999  ... enough to cover varchar(8000)
WITH cteDigits AS (
    SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
cteTally AS (
    SELECT [1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS num
    FROM cteDigits [1s] --0 thru 9
    CROSS JOIN cteDigits [10s] --thru 99
    CROSS JOIN cteDigits [100s] --thru 999
    CROSS JOIN cteDigits [1000s] --thru 9999
),
cteBase(num) AS ( --Limit the number of rows up front, for both a performance gain and prevention of accidental "overruns"
        SELECT 0 UNION ALL
        SELECT TOP (DATALENGTH(ISNULL(@string, 1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM cteTally
), 
cteStart(num1) AS ( --This returns num + 1 (starting position of each "value" just once for each delimiter)
        SELECT b.num + 1
        FROM cteBase b
        WHERE (SUBSTRING(@string, b.num, 1) = @delimiter OR b.num = 0)
)
--Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
SELECT ROW_NUMBER() OVER(ORDER BY s.num1) AS value_seq,
       LTRIM(RTRIM(SUBSTRING(@string, s.num1, ISNULL(NULLIF(CHARINDEX(@delimiter, @string, s.num1), 0) - s.num1, 8000)))) AS value
FROM cteStart s;

GO

Open in new window


Query:
SELECT
      CA.value
    , COUNT(*) AS COUNT_OF
FROM dbo.AllOpportunities AS AO
      CROSS APPLY dbo.SplitStringIntoTable(AO.description, SPACE(1)) AS CA
WHERE LEN(CA.value) > 1
GROUP BY
      CA.value
ORDER BY
      COUNT_OF DESC
    , CA.value ASC
;

Open in new window


Result:
|               VALUE | COUNT_OF |
|---------------------|----------|
|         integration |        2 |
|             magento |        2 |
| magento/integration |        1 |
|              amazon |        1 |
|                 and |        1 |
|                ebay |        1 |
|               exist |        1 |

Open in new window


see it working here: http://sqlfiddle.com/#!3/e4eb4/3

full details:
**MS SQL Server 2008 Schema Setup**:

    
    
    CREATE TABLE AllOpportunities 
    	([Description] varchar(200))
    ;
    	
    INSERT INTO AllOpportunities 
    	([Description])
    VALUES
    	('magento integration'),
    	('integration magento'),
    	('amazon and ebay exist'),
    	('magento/integration')
    ;
    
    
    GO
    
    CREATE FUNCTION [dbo].[SplitStringIntoTable]  (
        @string varchar(8000),
        @delimiter varchar(5)
        )
    RETURNS TABLE WITH SCHEMABINDING 
    AS
    
    RETURN
    --Inline-CTE-Driven "tally table" produces values from 0 to 9999  ... enough to cover varchar(8000)
    WITH cteDigits AS (
        SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
        SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
    ),
    cteTally AS (
        SELECT [1000s].digit * 1000 + [100s].digit * 100 + [10s].digit * 10 + [1s].digit AS num
        FROM cteDigits [1s] --0 thru 9
        CROSS JOIN cteDigits [10s] --thru 99
        CROSS JOIN cteDigits [100s] --thru 999
        CROSS JOIN cteDigits [1000s] --thru 9999
    ),
    cteBase(num) AS ( --Limit the number of rows up front, for both a performance gain and prevention of accidental "overruns"
            SELECT 0 UNION ALL
            SELECT TOP (DATALENGTH(ISNULL(@string, 1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM cteTally
    ), 
    cteStart(num1) AS ( --This returns num + 1 (starting position of each "value" just once for each delimiter)
            SELECT b.num + 1
            FROM cteBase b
            WHERE (SUBSTRING(@string, b.num, 1) = @delimiter OR b.num = 0)
    )
    --Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found. 
    SELECT ROW_NUMBER() OVER(ORDER BY s.num1) AS value_seq,
           LTRIM(RTRIM(SUBSTRING(@string, s.num1, ISNULL(NULLIF(CHARINDEX(@delimiter, @string, s.num1), 0) - s.num1, 8000)))) AS value
    FROM cteStart s;
    
    GO
    
    

**Query 1**:

    SELECT
          CA.value
        , COUNT(*) AS COUNT_OF
    FROM dbo.AllOpportunities AS AO
          CROSS APPLY dbo.SplitStringIntoTable(AO.description, SPACE(1)) AS CA
    WHERE LEN(CA.value) > 1
    GROUP BY
          CA.value
    ORDER BY
          COUNT_OF DESC
    

**[Results][2]**:
    
    |               VALUE | COUNT_OF |
    |---------------------|----------|
    |         integration |        2 |
    |             magento |        2 |
    | magento/integration |        1 |
    |              amazon |        1 |
    |                 and |        1 |
    |                ebay |        1 |
    |               exist |        1 |



  [1]: http://sqlfiddle.com/#!3/e4eb4/3

Open in new window


refer: PAQ
http://www.experts-exchange.com/Database/MS-SQL-Server/Q_28272656.html#a39589088
0
 

Author Closing Comment

by:chrismichalczuk
ID: 40679874
thanks
0

Featured Post

Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

Join & Write a Comment

If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now