Link to home
Start Free TrialLog in
Avatar of tbaseflug
tbaseflugFlag for United States of America

asked on

Determine word frequency/count in table

I have a column in table - which contains a product description - what the user wants is a way to list and count the frequency of use of words  - so that in the end, they can eliminate words such as this, these, that, etc and any other "custom" defined words to be excluded from search against... Is this possible -any examples?  Other than parsing out every single work and grouping/ counting, etc.?
ASKER CERTIFIED SOLUTION
Avatar of lludden
lludden
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
--i usually use charindex and keyword lookup table for something like this

declare @keywords table (id int, keyword nvarchar(100))

insert into @keywords values (1, 'this')
insert into @keywords values (2, 'that')
insert into @keywords values (3, 'the')

declare @products table (id int, code nvarchar(10), descr nvarchar(1000))

insert into @products values (1, 'abc', 'this grape popsicle')
insert into @products values (2, 'cgd', 'the pickled cherry poppers')
insert into @products values (3, 'jerb', 'derk er jerbs')
insert into @products values (4, 'asdf', 'that cherry jam')  


--count products with keywords
select      COUNT(*)
from      @products products
inner join @keywords keywords on
      charINDEX(keywords.keyword, products.descr) > 0



--replace words
select      products.*, rtrim(ltrim(REPLACE(products.descr, keywords.keyword, ''))) as MatchReplacedWithEmpthString
from      @products products
left outer join @keywords keywords on
      charINDEX(keywords.keyword, products.descr) > 0

      
Try this: (Modified from http://stanbiron.com/2010/07/29/SimpleWayToCountCharactersAndWordsUsingTSQL.aspx)
SELECT (LEN(<field>) - LEN(REPLACE(<field>, <key word>, ''))) / LEN(<key word)
FROM <table>

Open in new window

what is the ultimate purpose?

has fulltext indexing been considered... where "noise" words would be largely ignored...
Inclined to agree with Lowfatspread.

It sounds very much like Full Text Search is being sought. It is a bit different from 2005 to 2008 (so it becomes important to make that distinction).

There are "noise words" in pre 2008 and then stoplists from 2008 onwards. Have a read of : http://msdn.microsoft.com/en-us/library/ms142551.aspx look towards the bottom, you will also see "Viewing Stoplists and Stoplist Metadata"

You also have a thesaurus facility...

But to find if you have custom requirements or instances of words you want to ignore, then a query over the data is about the only way. And that can take some time to execute depending on size of data and number of rows.

Do you want some assistance with creating a query to list all the words and their frequency ? (is that frequency in terms of rows, or instance within a field, or throughout the column for the entire table)...


as an example... using a dummy table which we create first (just for testing)

create table #products (productID int identity primary key, Name varchar(200))
insert #products values ('This Product is a bewdy')
insert #products values ('Product 2 has this feature and that feature 2')
insert #products values ('Product 3 has nuts and bolts')
insert #products values ('Product 4 is only nuts and no bolts')
insert #products values ('Product 5 has no nuts and no bolts')
insert #products values ('Product 6 has this and that and more')

-- now the real query which is good for up to 2048 characters per column, if needing more then have to use a different numbers table approach

select word_of_name, count(*) as word_count
from (
      select productID, name, ltrim(rtrim(substring(name+' ',n,charindex(' ',name+' ',n+1)- n))) as word_of_name, count(*) over (partition by productid) as word_count
      FROM #products
      cross join (select number n from master..spt_values where type = 'P' and number > 0) numbers
      where substring(' '+rtrim(name),n,1) = ' '
      and n <= len(name)
     ) w
group by word_of_name
order by word_of_name

Open in new window


for a description of the spt_values please see : https://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/A_1221-Fun-with-MS-SQL-spt-values-for-delimited-strings-and-virtual-calendars.html
Interesting choice...

Just a small correction to the procedure (to use the delimiter) :

alter FUNCTION [dbo].[ParseString] (@String varchar(MAX), @Delimiter char(1) =',')  
RETURNS @StringList Table (
      String Varchar(50) ) 
AS  
BEGIN
      DECLARE @X1 as XML
      SET @X1 = (SELECT CAST('<i>' + REPLACE(@String,@delimiter,'</i><i>') + '</i>' AS XML))
      INSERT INTO @StringList
            SELECT ParamValues.ID.value('.','VARCHAR(20)')
            FROM @X1.nodes('/i') as ParamValues(ID) 
      RETURN
END

Open in new window



And then a word of warning about contents of the string. If it has XML unfriendly characters then it will run into problems.

For example, using my previous example #product file as a test, then check what *could* happen :

-- use tempdb so we dont get in the way of anything important...

USE TEMPDB
GO

if object_id('dbo.parsestring') is not null drop function dbo.parsestring
GO

CREATE FUNCTION [dbo].[ParseString] (@String varchar(MAX), @Delimiter char(1) =',')  
RETURNS @StringList Table (
      String Varchar(50) ) 
AS  
BEGIN
      DECLARE @X1 as XML
      SET @X1 = (SELECT CAST('<i>' + REPLACE(@String,@delimiter,'</i><i>') + '</i>' AS XML))
      INSERT INTO @StringList
            SELECT ParamValues.ID.value('.','VARCHAR(20)')
            FROM @X1.nodes('/i') as ParamValues(ID) 
      RETURN
END
GO

if object_id('tempdb..#products') is not null drop table #products
GO

create table #products (productID int identity primary key, Name varchar(200))
insert #products values ('This Product is a bewdy')
insert #products values ('Product 2 has this feature and that feature 2')
insert #products values ('Product 3 has nuts and bolts')
insert #products values ('Product 4 is only nuts and no bolts')
insert #products values ('Product 5 has no nuts and no bolts')
insert #products values ('Product 6 has this and that and more')

insert #products values ('Product 7 has < this > and < that > and / or lots more')    -- this line will abend the function

-- now the real query which is good for up to 2048 characters per column, if needing more then have to use a different numbers table approach

select word_of_name, count(*) as word_count
from (
      select productID, name, ltrim(rtrim(substring(name+' ',n,charindex(' ',name+' ',n+1)- n))) as word_of_name --, count(*) over (partition by productid) as word_count
      FROM #products
      cross apply (select number n from master..spt_values where type = 'P' and number > 0) numbers
      where substring(' '+rtrim(name),n,1) = ' '
      and n <= len(name)
     ) w
group by word_of_name
order by word_of_name 


select s.string, count(*)
from #products p
cross apply (select string from dbo.parsestring(p.name,' ')) s
group by s.string
order by s.string

-- now cleanup our tempdb (and will have to manually execute if above abended)

if object_id('dbo.parsestring') is not null drop function dbo.parsestring
GO

if object_id('tempdb..#products') is not null drop table #products
GO

Open in new window


Anyway, hope the above information will help any others coming along afterwards thinking that the "accepted" solution will solve the question being asked. Dont get me wrong, it is a good approach, but, it does have inherent problems and might not be suited. Oh,  and there was that small problem using the @delimiter.


Oh, and in terms of removing those "noise" words can either build a table (possible to import a text file that users maintain themselves), or, incorporate into the selection process (either as part of the function, or, as part of the selection)

e.g.

-- use tempdb so we dont get in the way of anything important...

USE TEMPDB
GO

if object_id('dbo.parsestring') is not null drop function dbo.parsestring
GO

if object_id('tempdb..#products') is not null drop table #products
GO

create table #products (productID int identity primary key, Name varchar(200))
insert #products values ('This Product is a bewdy')
insert #products values ('Product 2 has this feature and that feature 2')
insert #products values ('Product 3 has nuts and bolts')
insert #products values ('Product 4 is only nuts and no bolts')
insert #products values ('Product 5 has no nuts and no bolts')
insert #products values ('Product 6 has this and that and more')

insert #products values ('Product 7 has < this > and < that > and / or lots more')    -- this line will abend the function

-- now the real query which is good for up to 2048 characters per column, if needing more then have to use a different numbers table approach

select word_of_name, count(*) as word_count
from (
      select productID, name, ltrim(rtrim(substring(name+' ',n,charindex(' ',name+' ',n+1)- n))) as word_of_name
      FROM #products
      cross apply (select number n from master..spt_values where type = 'P' and number > 0) numbers
      where substring(' '+rtrim(name),n,1) = ' '
      and n <= len(name)
     ) w

where len(word_of_name ) > 1   -- or two, three, or four (whatever minimum length should be and suggest minimum three ie > 2)
and word_of_name not in ('is','or','yes','no','and','but','etc') -- build up a list as long as it is reasonable...

group by word_of_name
order by word_of_name 


-- now cleanup our tempdb (and will have to manually execute if above abended)

if object_id('tempdb..#products') is not null drop table #products
GO

Open in new window