Solved

select keywords on row

Posted on 2011-09-14
13
192 Views
Last Modified: 2012-05-12
hi, this is difficult for me, but i don't think is for u guys. here we go.

i want an sql query or storeprocedure that select 15 words that repeat more on some of the rows.

sample

id                text
1                 we will eat cheese today
2                 you guys are great developers
3                cheese is the best food ever
4                computer is great for developers
5                 food like cheese is never to eat alone

ok, so the query should give me something like this:  " cheese, developers, is, food, great " 

since the text is kind bigger i will only have 10 words most repeated between the range selected.

any ideas?
0
Comment
Question by:rafaelrgl
  • 7
  • 3
  • 2
  • +1
13 Comments
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36540756
Something like this may work. I convert each line into XML and then pull the words apart using XML .nodes() functionality. Once broken into rows, you can do simple aggregate to count the number of occurrences of a word. With a HAVING clause filtering only words that appear more than once, you can grab the top 15 with respect to the count of times they appear in your data set.
select top 15 word
from (
/* pull words apart using xml */
select r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml, 
                  '<word>'+replace(txt, ' ', '</word><word>')+'</word>'
          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/word') r(w)

) derived
group by word
having count(1) > 1
order by count(1) desc
;

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36540847
Here are some additional thoughts to making this better:

select top 15 word, count(distinct id) cnt
from (
/* pull words apart using xml */
select id, r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml,
                  '<w>'+replace(
                           replace(
                              replace(
                                 replace(txt, ' ', '|')
                              ,',', '|')
                           , '.', '|')
                        , '|', '</w><w>')+'</w>'

          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/w') r(w)

) derived
group by word
having count(distinct id) > 1
order by cnt desc
;

- replace other punctuation like '.', ',', etc.
- use count(distinct id), so that you don't get hits where a word is repeated several times but all on the same line -- unless of course you want to count that. Based on "that repeat more on some of the rows" I suspect that you want words on multiple rows only.
0
 
LVL 9

Assisted Solution

by:mimran18
mimran18 earned 100 total points
ID: 36541054
Hi,
   Here we go.
 
-- Create Table for  Sample Data 
Drop Table Test
Go
CREATE TABLE Test 
( 
ID INT, 
AllNames VARCHAR(100) 
) 
GO 
-- Load Sample Data 
INSERT INTO test SELECT 
1, 'we will eat cheese today' UNION ALL SELECT 
2, 'you guys are great developers'  UNION ALL SELECT 
3, 'cheese is the best food ever' UNION ALL SELECT 
4, 'computer is great for developers'  UNION ALL SELECT 
5, 'food like cheese is never to eat alone'

GO 

-- Verify the Sample Data 
SELECT Id, AllNames 
FROM Test
 
 GO

;WITH Cte AS 
( 
    SELECT 
        id, 
        CAST('<M>' + REPLACE( Allnames, ' ' , '</M><M>') + '</M>' AS XML) AS Names 
    FROM Test 
) 
,Cte2 as (
Select 
    ID, 
    Split.a.value('.', 'VARCHAR(100)') AS Names
FROM Cte 
CROSS APPLY Names.nodes('/M') Split(a))
Select top 15 [Names],COUNT(*) from Cte2
Group By [Names]
Order By COUNT(*) DESC

Open in new window

0
The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36541102
@mimran18: all it appears you did was take my suggestion and move it common table expressions?
Anyway, see my notes as you are missing some key pieces like ensuring that you don't get [Names] that only appear ONCE and that you account for a [Names] value appearing multiple times on the same row only.
0
 
LVL 9

Expert Comment

by:mimran18
ID: 36541215
@mwvisa1:You are absolutly right,The credit goes to you and
'http://mangalpardeshi.blogspot.com/2009/03/how-to-split-comma-delimited-string.html'

If you are going to remove "top 15" it will give you the records that appear only once.

cheers
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36541245
Not really worried about the credit...just wondering why you posted the same solution. That link appears broken for some reason...Might be the quotes, I will try again. The TOP 15 is not the issue. The issue is that ordering by COUNT(*) DESC alone assumes that you have 15 words that appear more than once. To ensure that invalid entries don't make it into the TOP 15, you need to add a HAVING or similar filter on COUNT() to make sure it is greater than 1.
0
 
LVL 9

Expert Comment

by:mimran18
ID: 36541272
ya, you are right will alter it.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36541293
By the way, if you do not have permission of the Author of that blog, then it is inappropriate to copy solution verbatim here. As you can see, the technique is well-known, so that is not the issue. What is the issue is you didn't cite the source. Please be careful of this; the Moderators tend to frown on such activity. :)

Anyway, no harm no foul as we have citation of the source in the thread now.
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 50 total points
ID: 36543571
will the text contain personal names?

if so i don't think the xml split approach will work... since you would need to deal with O'Neil , O'brian etc...

if you need to deal with contractions (isn't , is not ... haven't have not) ..?

then you would need to write you own word split function beforehand...

do you want to exclude numbers ? ... similarly


could you expand on why you are doing this and we maybe able to assist further?

ps
please confirm this isn't homework/coursework...
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36543949
Good thoughts as usual, LFS! :)
0
 
LVL 1

Author Comment

by:rafaelrgl
ID: 36546687
that's not homework/coursework. This is an project of an website that i need to have the keywords to put on the metatag keywords. So search engines will find my content better. hope i explain.
so give me one day i will test it and get back to u guys. btw. thanks a lots for the quick response and all your help.

0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 350 total points
ID: 36546697
You are most welcome. I tested and didn't have any issues with single quotes per LFS's comment, but do let us know if you run into any trouble. See my second post for clarification/refinement of the code to account for punctuation, etc. Things like ampersands that give XML difficult might cause problems, but we can probably get around that with a little work, so again you know where to find us.
0
 
LVL 1

Author Closing Comment

by:rafaelrgl
ID: 36564069
Thanks a lots. you guys are always great. thanks for all the help.
0

Featured Post

U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Hi all, It is important and often overlooked to understand “Database properties”. Often we see questions about "log files" or "where is the database" and one of the easiest ways to get general information about your database is to use “Database p…
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
This video shows how to use Hyena, from SystemTools Software, to bulk import 100 user accounts from an external text file. View in 1080p for best video quality.
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question