Solved

select keywords on row

Posted on 2011-09-14
13
186 Views
Last Modified: 2012-05-12
hi, this is difficult for me, but i don't think is for u guys. here we go.

i want an sql query or storeprocedure that select 15 words that repeat more on some of the rows.

sample

id                text
1                 we will eat cheese today
2                 you guys are great developers
3                cheese is the best food ever
4                computer is great for developers
5                 food like cheese is never to eat alone

ok, so the query should give me something like this:  " cheese, developers, is, food, great "

since the text is kind bigger i will only have 10 words most repeated between the range selected.

any ideas?
0
Comment
Question by:rafaelrgl
  • 7
  • 3
  • 2
  • +1
13 Comments
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Something like this may work. I convert each line into XML and then pull the words apart using XML .nodes() functionality. Once broken into rows, you can do simple aggregate to count the number of occurrences of a word. With a HAVING clause filtering only words that appear more than once, you can grab the top 15 with respect to the count of times they appear in your data set.
select top 15 word
from (
/* pull words apart using xml */
select r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml, 
                  '<word>'+replace(txt, ' ', '</word><word>')+'</word>'
          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/word') r(w)

) derived
group by word
having count(1) > 1
order by count(1) desc
;

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Here are some additional thoughts to making this better:

select top 15 word, count(distinct id) cnt
from (
/* pull words apart using xml */
select id, r.w.value('.', 'varchar(50)') word
from (
   select id
        , convert(xml,
                  '<w>'+replace(
                           replace(
                              replace(
                                 replace(txt, ' ', '|')
                              ,',', '|')
                           , '.', '|')
                        , '|', '</w><w>')+'</w>'

          ) as words
   from sampletbl
) t
cross apply t.words.nodes('/w') r(w)

) derived
group by word
having count(distinct id) > 1
order by cnt desc
;

- replace other punctuation like '.', ',', etc.
- use count(distinct id), so that you don't get hits where a word is repeated several times but all on the same line -- unless of course you want to count that. Based on "that repeat more on some of the rows" I suspect that you want words on multiple rows only.
0
 
LVL 9

Assisted Solution

by:mimran18
mimran18 earned 100 total points
Comment Utility
Hi,
   Here we go.
 
-- Create Table for  Sample Data 
Drop Table Test
Go
CREATE TABLE Test 
( 
ID INT, 
AllNames VARCHAR(100) 
) 
GO 
-- Load Sample Data 
INSERT INTO test SELECT 
1, 'we will eat cheese today' UNION ALL SELECT 
2, 'you guys are great developers'  UNION ALL SELECT 
3, 'cheese is the best food ever' UNION ALL SELECT 
4, 'computer is great for developers'  UNION ALL SELECT 
5, 'food like cheese is never to eat alone'

GO 

-- Verify the Sample Data 
SELECT Id, AllNames 
FROM Test
 
 GO

;WITH Cte AS 
( 
    SELECT 
        id, 
        CAST('<M>' + REPLACE( Allnames, ' ' , '</M><M>') + '</M>' AS XML) AS Names 
    FROM Test 
) 
,Cte2 as (
Select 
    ID, 
    Split.a.value('.', 'VARCHAR(100)') AS Names
FROM Cte 
CROSS APPLY Names.nodes('/M') Split(a))
Select top 15 [Names],COUNT(*) from Cte2
Group By [Names]
Order By COUNT(*) DESC

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
@mimran18: all it appears you did was take my suggestion and move it common table expressions?
Anyway, see my notes as you are missing some key pieces like ensuring that you don't get [Names] that only appear ONCE and that you account for a [Names] value appearing multiple times on the same row only.
0
 
LVL 9

Expert Comment

by:mimran18
Comment Utility
@mwvisa1:You are absolutly right,The credit goes to you and
'http://mangalpardeshi.blogspot.com/2009/03/how-to-split-comma-delimited-string.html'

If you are going to remove "top 15" it will give you the records that appear only once.

cheers
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Not really worried about the credit...just wondering why you posted the same solution. That link appears broken for some reason...Might be the quotes, I will try again. The TOP 15 is not the issue. The issue is that ordering by COUNT(*) DESC alone assumes that you have 15 words that appear more than once. To ensure that invalid entries don't make it into the TOP 15, you need to add a HAVING or similar filter on COUNT() to make sure it is greater than 1.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 9

Expert Comment

by:mimran18
Comment Utility
ya, you are right will alter it.
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
By the way, if you do not have permission of the Author of that blog, then it is inappropriate to copy solution verbatim here. As you can see, the technique is well-known, so that is not the issue. What is the issue is you didn't cite the source. Please be careful of this; the Moderators tend to frown on such activity. :)

Anyway, no harm no foul as we have citation of the source in the thread now.
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 50 total points
Comment Utility
will the text contain personal names?

if so i don't think the xml split approach will work... since you would need to deal with O'Neil , O'brian etc...

if you need to deal with contractions (isn't , is not ... haven't have not) ..?

then you would need to write you own word split function beforehand...

do you want to exclude numbers ? ... similarly


could you expand on why you are doing this and we maybe able to assist further?

ps
please confirm this isn't homework/coursework...
0
 
LVL 59

Expert Comment

by:Kevin Cross
Comment Utility
Good thoughts as usual, LFS! :)
0
 
LVL 1

Author Comment

by:rafaelrgl
Comment Utility
that's not homework/coursework. This is an project of an website that i need to have the keywords to put on the metatag keywords. So search engines will find my content better. hope i explain.
so give me one day i will test it and get back to u guys. btw. thanks a lots for the quick response and all your help.

0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 350 total points
Comment Utility
You are most welcome. I tested and didn't have any issues with single quotes per LFS's comment, but do let us know if you run into any trouble. See my second post for clarification/refinement of the code to account for punctuation, etc. Things like ampersands that give XML difficult might cause problems, but we can probably get around that with a little work, so again you know where to find us.
0
 
LVL 1

Author Closing Comment

by:rafaelrgl
Comment Utility
Thanks a lots. you guys are always great. thanks for all the help.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Suggested Solutions

Data architecture is an important aspect in Software as a Service (SaaS) delivery model. This article is a study on the database of a single-tenant application that could be extended to support multiple tenants. The application is web-based develope…
In this article I will describe the Backup & Restore method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now