Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

select keywords on row

Posted on 2011-09-14
13
Medium Priority
?
197 Views
Last Modified: 2012-05-12
hi, this is difficult for me, but i don't think is for u guys. here we go.

i want an sql query or storeprocedure that select 15 words that repeat more on some of the rows.

sample

id                text
1                 we will eat cheese today
2                 you guys are great developers
3                cheese is the best food ever
4                computer is great for developers
5                 food like cheese is never to eat alone

ok, so the query should give me something like this:  " cheese, developers, is, food, great " 

since the text is kind bigger i will only have 10 words most repeated between the range selected.

any ideas?
0
Comment
Question by:rafaelrgl
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 3
  • 2
  • +1
13 Comments
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36540756
Something like this may work. I convert each line into XML and then pull the words apart using XML .nodes() functionality. Once broken into rows, you can do simple aggregate to count the number of occurrences of a word. With a HAVING clause filtering only words that appear more than once, you can grab the top 15 with respect to the count of times they appear in your data set.
select top 15 word
from (
/* pull words apart using xml */
select r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml, 
                  '<word>'+replace(txt, ' ', '</word><word>')+'</word>'
          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/word') r(w)

) derived
group by word
having count(1) > 1
order by count(1) desc
;

Open in new window

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36540847
Here are some additional thoughts to making this better:

select top 15 word, count(distinct id) cnt
from (
/* pull words apart using xml */
select id, r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml,
                  '<w>'+replace(
                           replace(
                              replace(
                                 replace(txt, ' ', '|')
                              ,',', '|')
                           , '.', '|')
                        , '|', '</w><w>')+'</w>'

          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/w') r(w)

) derived
group by word
having count(distinct id) > 1
order by cnt desc
;

- replace other punctuation like '.', ',', etc.
- use count(distinct id), so that you don't get hits where a word is repeated several times but all on the same line -- unless of course you want to count that. Based on "that repeat more on some of the rows" I suspect that you want words on multiple rows only.
0
 
LVL 9

Assisted Solution

by:mimran18
mimran18 earned 400 total points
ID: 36541054
Hi,
   Here we go.
 
-- Create Table for  Sample Data 
Drop Table Test
Go
CREATE TABLE Test 
( 
ID INT, 
AllNames VARCHAR(100) 
) 
GO 
-- Load Sample Data 
INSERT INTO test SELECT 
1, 'we will eat cheese today' UNION ALL SELECT 
2, 'you guys are great developers'  UNION ALL SELECT 
3, 'cheese is the best food ever' UNION ALL SELECT 
4, 'computer is great for developers'  UNION ALL SELECT 
5, 'food like cheese is never to eat alone'

GO 

-- Verify the Sample Data 
SELECT Id, AllNames 
FROM Test
 
 GO

;WITH Cte AS 
( 
    SELECT 
        id, 
        CAST('<M>' + REPLACE( Allnames, ' ' , '</M><M>') + '</M>' AS XML) AS Names 
    FROM Test 
) 
,Cte2 as (
Select 
    ID, 
    Split.a.value('.', 'VARCHAR(100)') AS Names
FROM Cte 
CROSS APPLY Names.nodes('/M') Split(a))
Select top 15 [Names],COUNT(*) from Cte2
Group By [Names]
Order By COUNT(*) DESC

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36541102
@mimran18: all it appears you did was take my suggestion and move it common table expressions?
Anyway, see my notes as you are missing some key pieces like ensuring that you don't get [Names] that only appear ONCE and that you account for a [Names] value appearing multiple times on the same row only.
0
 
LVL 9

Expert Comment

by:mimran18
ID: 36541215
@mwvisa1:You are absolutly right,The credit goes to you and
'http://mangalpardeshi.blogspot.com/2009/03/how-to-split-comma-delimited-string.html'

If you are going to remove "top 15" it will give you the records that appear only once.

cheers
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36541245
Not really worried about the credit...just wondering why you posted the same solution. That link appears broken for some reason...Might be the quotes, I will try again. The TOP 15 is not the issue. The issue is that ordering by COUNT(*) DESC alone assumes that you have 15 words that appear more than once. To ensure that invalid entries don't make it into the TOP 15, you need to add a HAVING or similar filter on COUNT() to make sure it is greater than 1.
0
 
LVL 9

Expert Comment

by:mimran18
ID: 36541272
ya, you are right will alter it.
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36541293
By the way, if you do not have permission of the Author of that blog, then it is inappropriate to copy solution verbatim here. As you can see, the technique is well-known, so that is not the issue. What is the issue is you didn't cite the source. Please be careful of this; the Moderators tend to frown on such activity. :)

Anyway, no harm no foul as we have citation of the source in the thread now.
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 200 total points
ID: 36543571
will the text contain personal names?

if so i don't think the xml split approach will work... since you would need to deal with O'Neil , O'brian etc...

if you need to deal with contractions (isn't , is not ... haven't have not) ..?

then you would need to write you own word split function beforehand...

do you want to exclude numbers ? ... similarly


could you expand on why you are doing this and we maybe able to assist further?

ps
please confirm this isn't homework/coursework...
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36543949
Good thoughts as usual, LFS! :)
0
 
LVL 1

Author Comment

by:rafaelrgl
ID: 36546687
that's not homework/coursework. This is an project of an website that i need to have the keywords to put on the metatag keywords. So search engines will find my content better. hope i explain.
so give me one day i will test it and get back to u guys. btw. thanks a lots for the quick response and all your help.

0
 
LVL 60

Accepted Solution

by:
Kevin Cross earned 1400 total points
ID: 36546697
You are most welcome. I tested and didn't have any issues with single quotes per LFS's comment, but do let us know if you run into any trouble. See my second post for clarification/refinement of the code to account for punctuation, etc. Things like ampersands that give XML difficult might cause problems, but we can probably get around that with a little work, so again you know where to find us.
0
 
LVL 1

Author Closing Comment

by:rafaelrgl
ID: 36564069
Thanks a lots. you guys are always great. thanks for all the help.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION: While tying your database objects into builds and your enterprise source control system takes a third-party product (like Visual Studio Database Edition or Red-Gate's SQL Source Control), you can achieve some protection using a sing…
In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Video by: ITPro.TV
In this episode Don builds upon the troubleshooting techniques by demonstrating how to properly monitor a vSphere deployment to detect problems before they occur. He begins the show using tools found within the vSphere suite as ends the show demonst…
Want to learn how to record your desktop screen without having to use an outside camera. Click on this video and learn how to use the cool google extension called "Screencastify"! Step 1: Open a new google tab Step 2: Go to the left hand upper corn…

715 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question