Solved

Keyword algorithm to understand a web page.

Posted on 2011-03-23
11
735 Views
Last Modified: 2013-11-15
Hello,

I'm working on a project that will extract different components from a webpage to try to come up with a set of relevant keywords to summarize the page.

Example URL:
http://www.hollywoodreporter.com/news/charlie-sheen-fired-two-a-165014

Example Title:
Charlie Sheen Fired From 'Two and a Half Men'

Example Article Text:
"After a weeks-long media circus that had Charlie Sheen attacking everything from Alcoholics Anonymous to Two and a Half Men’s co-creator Chuck Lorre, Warner Bros. Television has decided to fire the No. 1 comedy’s star.
"After careful consideration, Warner Bros. Television has terminated Charlie Sheen’s services on Two and a Half Men effective immediately," Warner Bros  Television said in a statement Monday.
Reached by TMZ, Sheen, 45, said:  "This is very good news. They continue to be in breach, like so many whales. It is a big day of gladness at the Sober Valley Lodge because now I can take all of the bazillions, never have to look at whatshiscock again and I never have to put on those silly shirts for as long as this warlock exists in the terrestrial dimension...."

Example keyword summary:
- sheen - 71
- charlie - 68
- fired - 61
- half - 45
- hollywood - 29
- two - 25
- news - 25
- reporter - 23
- warner - 7
- bros - 7
- television - 7
- cbs - 5

My hope is to create a keyword list that will be relatively accurate in summarizing the article.

So my question to the EE community is what are some things I should do to make this list better or more accurate?

Any and all ideas, or suggestions will be great!

The technologies I will be using will be php and MySQL however I'm open to suggestions but for this discussion I would like to stick to the logic / algorithm.

Thanks.
0
Comment
Question by:jambla
  • 5
  • 3
  • 3
11 Comments
 
LVL 11

Expert Comment

by:lenordiste
Comment Utility
you could modify the algorithm to handle groups of keywords since it will be more relevant. For instance, instead of having two keywords "warner" and "bros", you could have just one "warner bros" which is more meaningful.

To achieve this what you can do is as soon as you find two identical words, say at position A and B, look for the first word after A and see if it matches the first word after B. If so than start searching for the combined words and repeat the process until your document is processed.

Also, it might be a good idea to store some words in a table in your database to systematically exclude them from the result. For instance numbers "one","two","three", and adjectives "big","small" are not really interesting keywords and should be removed.
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
You basically need some kind of massive dictionary that has the all words and the combinations of words that you might see with a number that says how interesting the word is. Then you can scale that by the count of each word in the article divided by the average count of that word in all the articles. Most of this type of work has been done somewhere. Web data mining is a popular research topic.
0
 

Author Comment

by:jambla
Comment Utility
Hello,

@lenordiste thanks for the ideas.  Yeah, grouping words will be something important for sure.  I will look more into that.  I know that having a list of non-important words it something I will have to do but I have been holding off on that for a bit.  If I have a list of words that are not important I limit my language to one language or to a few languages.  I would like a rough algorithms that would not be a specific language dependent.  Probably a tall task, I know.


@TommySzalapski thanks for the ideas.  Although your idea is sound I think the task of doing this would be immense!  and another problem with this idea is it limits the keywords to a specific language or a group of languages.  While I am in this thought/planning stage I was to try to think of something a little more elegant that would include other or all languages.  I might not be able to do this however round 1 I will try.


Any other suggestions or ideas about this?  I'm very anxious to hear them!
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
I would like a rough algorithms that would not be a specific language dependent.
Unfortunately, this is basically impossible. How is the algorithm to know how important a word is? Here is one thing you could do. Get the local support of each word on a page (local support = countOfWordOnPage/totalWordsOnPage) and compare it to the global support for the same word (gobal support = average local support for all pages).
That will give you an idea of how 'interesting' the word is. The only problem with that is that if someone uses a word like "It was superbamendous" then that word will look very important, but it will pull out proper names fairly well. Of course common names like John in English or Vimal in Hindi won't seem as important.
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
If you can get your hands on a large database of names used then you could add that to your dictionary to compare against. This would also help you to group first and last names since you could find them more easily.

If you look around using terms like "Text Mining", "Semantic Analysis", "Web Data Mining", etc, you may find a list similar to the one I mentioned, but if you are determined to start from scratch, then you'll have to go with something more simple (like the supports I just mentioned).

Your problem is a highly researched academic topic. Many people smarter than either of us have been working on similar things for a while.
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 11

Expert Comment

by:lenordiste
Comment Utility
I really like the idea of not using a dictionary. It's not impossible it just changes how you look at the problem and the level of quality you re expecting from the algorithm. One fun way to determine if a group of words is important is maybe to compare it to other keywords found in articles. A bit similar to what google is doing for ranking websites.

Also you could take user feedback into account, allowing them to rate how relevant keywords are.
0
 

Author Comment

by:jambla
Comment Utility
@TommySzalapski  Thanks for your lengthy response.  It was very insightful.  I will do some data crunching on "local support = countOfWordOnPage/totalWordsOnPage" and "gobal support = average local support for all pages".  Also I will Google terms like "Text Mining", "Semantic Analysis" and "Web Data Mining" to see if I can find any research paper or anything to help me.

@lenordiste  "Also you could take user feedback into account, allowing them to rate how relevant keywords are."  That's interesting, creating a user feedback system to improve the quality of the keywords is worth a look into.

Over the past day or so I have found that I can create a more accurate result by finding the words that are used the most combined with word groups.  So something like Tom and Cruise are relatively meaningless alone but grouped together they are of course "Tom Cruise" which has a significant meaning.  Also giving weight to article titles, and URL words also help to improve the results from what I have seen.

So from the above article we would get a result something like this:

Title: Charlie Sheen Fired From 'Two and a Half Men'
URL: http://www.hollywoodreporter.com/news/charlie-sheen-fired-two-a-165014
Text: "After a weeks-long media circus that had Charlie Sheen attacking everything from Alcoholics Anonymous to Two and a Half Men’s co-creator Chuck Lorre, Warner Bros. Television has decided to fire the No. 1 comedy’s star.
"After careful consideration, Warner Bros. Television has terminated Charlie Sheen’s services on Two and a Half Men effective immediately," Warner Bros  Television said in a statement Monday.
Reached by TMZ, Sheen, 45, said:  "This is very good news. They continue to be in breach, like so many whales. It is a big day of gladness at the Sober Valley Lodge because now I can take all of the bazillions, never have to look at whatshiscock again and I never have to put on those silly shirts for as long as this warlock exists in the terrestrial dimension...."


Just by looking at this example we can see that these words are used more:

- sheen - 71, - charlie - 68, - fired - 61, - half - 45, - hollywood - 29, - two - 25, - news - 25, - reporter - 23, - warner - 7, - bros - 7, - television - 7, - cbs - 5


And if we combine groups of words that happen often we see things like "Charlie Sheen", "Two and a Half Men", "Warner Bros"


Getting the words frequency is quite easy, but now I'm trying to write a script that will look at groupings of words within the article.  If you have any advise, suggestions or code that would be fantastic!

Thanks again.
0
 
LVL 11

Accepted Solution

by:
lenordiste earned 250 total points
Comment Utility
it's kind of hard to explain without actual code (I don't have VS on this computer). But grouping words is a bit like a lossless data compression , for instance LZW compression. Here is an example:http://www.cs.cf.ac.uk/Dave/Multimedia/node214.html#SECTION04247000000000000000

The difference is that your test case for when you want to add a word to your dictionary is that the word (or word group) must occur at least twice.
0
 
LVL 37

Assisted Solution

by:TommySzalapski
TommySzalapski earned 250 total points
Comment Utility
What he's suggesting is to track every pair of words. If you see the same pair twice, add it to the list of pairs and also start tracking three word combos that start with that pair, etc, etc. (That's kind of how LZW works). This is a pretty decent way of doing it. I would suggest using a hash table of some kind to look up the word pairs (and the words in general) since the list of words will get long. .NET has a Dictionary class that will be excellent for this. The algorithm would be something like.

for each word
  s = that word + the next word
  CheckString(s)
  remove all but last word from s
end for

CheckString(string s)
If s is in dictionary
  add 1 to count of s
  CheckString(s + next word in file)
else
  add word to dictionary with count of 1
end if
0
 

Author Comment

by:jambla
Comment Utility
I have been reading about data mining for the past day or so.  Very interesting.  There have been a few people that have tried what I have been thinking about.  However they went to a level that I can't reach for sure.  But I'm not interested in getting anything that's like 90+% accurate.  I'm looking for a group of keywords that will relate to the article.

@TommySzalapski  I like the algorithm that you have there.

I am not a strong PHP programmer.  Right now, I have a script that will look at the URL, and the title (if there is an H1, H2, H3 present).  I give points (maybe a x2) to each separate word in the URL (after the suffix) and title.  Then I am counting the words and giving a point to each time they occur on the page.

My problem now is I don't know how to code a word grouping counter.

Example Text:  "After a weeks long media circus that had Charlie Sheen..."

It would start with "After + a" (the first two words in the text). and search for any other "After a" matches.  If there is a match then the count will go up by one (++).  Then it will add one word to the string "After a weeks" and do the same.  It will do this for maybe 5 times...

1. After + a
2. After + a + weeks
3. After + a + weeks + long
4. After + a + weeks + long + media
5. After + a + weeks + long + media + circus

Then it would drop "After" and start back at step one:

1. a + weeks
2. a + weeks + long
3. a + weeks + long + media

and so on...

As I can imaging this could take some time, so I wouldn't do it on the fly but would rather run it in the background possibly as a cron job or something.

Is there anyone out there that could show me what something like this would look like in PHP?

I hope I haven't wore out my welcome yet.

If necessary, I could compensate you with a small gift possibly an Amazon wish list item or something.  Please forgive me if this is against the TOS.  I didn't read them and I'm new here.
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
It actually wouldn't take all that long (if you are using some kind of hash table or dictionary especially). Your idea seems okay (mine last one works best for very large datasets so maybe not as good for individual articles).
I would modify your approach like this:
Since you are only concerned with up to 5 words, just have a dictionary/list for each size. So each time you look at the next word, you add that word to the current word group for each size and remove the first word from each group. Then you add the group to the correct dictionary or increment the count if it's there already.
In this way, you only have to loop through the article once so it will be very fast (assuming you keep the dictionaries in some good way). If you do it in Java or C# (or any .NET language), then I know there is a Dictionary class that will work perfectly for it. I don't know PHP at all, there might be something there.

To address your TOS question:
This quesion asked for an algorithm, not code. The best way to handle this would be to accept the solutions in this thread that gave you the algorithm you wanted, then ask a related question in the PHP zone asking for help coding the algorithm.
Giving a gift isn't totally against the ideas in the TOS, but if you want to give compensation for work here, the correct way is to hit the "Hire Me" button (if it's there) in the profile of an expert that you think could/would do it. I can't write PHP, so don't click mine. Go to the PHP zone and look at the stats for who got a lot of points recently (month or year) and ask them.
I'd just post the question in that zone first, though. It won't be a long script, and you will probably find someone who will do it for 500 points.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Entering time in Microsoft Access can be difficult. An input mask often bothers users more than helping them and won't catch all typing errors. This article shows how to create a textbox for 24-hour time input with full validation politely catching …
APEX (Application Express) is used to develop a web application from Oracle. SQL Workshop is one of the tools that comes with Oracle APEX to query or modify the database objects or to make any changes to the structure.
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now