asked on

Keyword algorithm to understand a web page.

Hello,

I'm working on a project that will extract different components from a webpage to try to come up with a set of relevant keywords to summarize the page.

Example URL:
http://www.hollywoodreporter.com/news/charlie-sheen-fired-two-a-165014

Example Title:
Charlie Sheen Fired From 'Two and a Half Men'

Example Article Text:
"After a weeks-long media circus that had Charlie Sheen attacking everything from Alcoholics Anonymous to Two and a Half Men’s co-creator Chuck Lorre, Warner Bros. Television has decided to fire the No. 1 comedy’s star.
"After careful consideration, Warner Bros. Television has terminated Charlie Sheen’s services on Two and a Half Men effective immediately," Warner Bros Television said in a statement Monday.
Reached by TMZ, Sheen, 45, said: "This is very good news. They continue to be in breach, like so many whales. It is a big day of gladness at the Sober Valley Lodge because now I can take all of the bazillions, never have to look at whatshiscock again and I never have to put on those silly shirts for as long as this warlock exists in the terrestrial dimension...."

Example keyword summary:
- sheen - 71
- charlie - 68
- fired - 61
- half - 45
- hollywood - 29
- two - 25
- news - 25
- reporter - 23
- warner - 7
- bros - 7
- television - 7
- cbs - 5

My hope is to create a keyword list that will be relatively accurate in summarizing the article.

So my question to the EE community is what are some things I should do to make this list better or more accurate?

Any and all ideas, or suggestions will be great!

The technologies I will be using will be php and MySQL however I'm open to suggestions but for this discussion I would like to stick to the logic / algorithm.

Thanks.

lenordiste

you could modify the algorithm to handle groups of keywords since it will be more relevant. For instance, instead of having two keywords "warner" and "bros", you could have just one "warner bros" which is more meaningful.

To achieve this what you can do is as soon as you find two identical words, say at position A and B, look for the first word after A and see if it matches the first word after B. If so than start searching for the combined words and repeat the process until your document is processed.

Also, it might be a good idea to store some words in a table in your database to systematically exclude them from the result. For instance numbers "one","two","three", and adjectives "big","small" are not really interesting keywords and should be removed.

TommySzalapski

You basically need some kind of massive dictionary that has the all words and the combinations of words that you might see with a number that says how interesting the word is. Then you can scale that by the count of each word in the article divided by the average count of that word in all the articles. Most of this type of work has been done somewhere. Web data mining is a popular research topic.

jambla

ASKER

Hello,

@lenordiste thanks for the ideas. Yeah, grouping words will be something important for sure. I will look more into that. I know that having a list of non-important words it something I will have to do but I have been holding off on that for a bit. If I have a list of words that are not important I limit my language to one language or to a few languages. I would like a rough algorithms that would not be a specific language dependent. Probably a tall task, I know.

@TommySzalapski thanks for the ideas. Although your idea is sound I think the task of doing this would be immense! and another problem with this idea is it limits the keywords to a specific language or a group of languages. While I am in this thought/planning stage I was to try to think of something a little more elegant that would include other or all languages. I might not be able to do this however round 1 I will try.

Any other suggestions or ideas about this? I'm very anxious to hear them!

TommySzalapski

I would like a rough algorithms that would not be a specific language dependent.
Unfortunately, this is basically impossible. How is the algorithm to know how important a word is? Here is one thing you could do. Get the local support of each word on a page (local support = countOfWordOnPage/totalWordsOnPage) and compare it to the global support for the same word (gobal support = average local support for all pages).
That will give you an idea of how 'interesting' the word is. The only problem with that is that if someone uses a word like "It was superbamendous" then that word will look very important, but it will pull out proper names fairly well. Of course common names like John in English or Vimal in Hindi won't seem as important.

TommySzalapski

If you can get your hands on a large database of names used then you could add that to your dictionary to compare against. This would also help you to group first and last names since you could find them more easily.

If you look around using terms like "Text Mining", "Semantic Analysis", "Web Data Mining", etc, you may find a list similar to the one I mentioned, but if you are determined to start from scratch, then you'll have to go with something more simple (like the supports I just mentioned).

Your problem is a highly researched academic topic. Many people smarter than either of us have been working on similar things for a while.

lenordiste

I really like the idea of not using a dictionary. It's not impossible it just changes how you look at the problem and the level of quality you re expecting from the algorithm. One fun way to determine if a group of words is important is maybe to compare it to other keywords found in articles. A bit similar to what google is doing for ranking websites.

Also you could take user feedback into account, allowing them to rate how relevant keywords are.

jambla

ASKER

@TommySzalapski Thanks for your lengthy response. It was very insightful. I will do some data crunching on "local support = countOfWordOnPage/totalWordsOnPage" and "gobal support = average local support for all pages". Also I will Google terms like "Text Mining", "Semantic Analysis" and "Web Data Mining" to see if I can find any research paper or anything to help me.

@lenordiste "Also you could take user feedback into account, allowing them to rate how relevant keywords are." That's interesting, creating a user feedback system to improve the quality of the keywords is worth a look into.

Over the past day or so I have found that I can create a more accurate result by finding the words that are used the most combined with word groups. So something like Tom and Cruise are relatively meaningless alone but grouped together they are of course "Tom Cruise" which has a significant meaning. Also giving weight to article titles, and URL words also help to improve the results from what I have seen.

So from the above article we would get a result something like this:

Title: Charlie Sheen Fired From 'Two and a Half Men'
URL: http://www.hollywoodreporter.com/news/charlie-sheen-fired-two-a-165014
Text: "After a weeks-long media circus that had Charlie Sheen attacking everything from Alcoholics Anonymous to Two and a Half Men’s co-creator Chuck Lorre, Warner Bros. Television has decided to fire the No. 1 comedy’s star.
"After careful consideration, Warner Bros. Television has terminated Charlie Sheen’s services on Two and a Half Men effective immediately," Warner Bros Television said in a statement Monday.
Reached by TMZ, Sheen, 45, said: "This is very good news. They continue to be in breach, like so many whales. It is a big day of gladness at the Sober Valley Lodge because now I can take all of the bazillions, never have to look at whatshiscock again and I never have to put on those silly shirts for as long as this warlock exists in the terrestrial dimension...."

Just by looking at this example we can see that these words are used more:

- sheen - 71, - charlie - 68, - fired - 61, - half - 45, - hollywood - 29, - two - 25, - news - 25, - reporter - 23, - warner - 7, - bros - 7, - television - 7, - cbs - 5

And if we combine groups of words that happen often we see things like "Charlie Sheen", "Two and a Half Men", "Warner Bros"

Getting the words frequency is quite easy, but now I'm trying to write a script that will look at groupings of words within the article. If you have any advise, suggestions or code that would be fantastic!

Thanks again.

ASKER CERTIFIED SOLUTION

lenordiste

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

TommySzalapski

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jambla

ASKER

I have been reading about data mining for the past day or so. Very interesting. There have been a few people that have tried what I have been thinking about. However they went to a level that I can't reach for sure. But I'm not interested in getting anything that's like 90+% accurate. I'm looking for a group of keywords that will relate to the article.

@TommySzalapski I like the algorithm that you have there.

I am not a strong PHP programmer. Right now, I have a script that will look at the URL, and the title (if there is an H1, H2, H3 present). I give points (maybe a x2) to each separate word in the URL (after the suffix) and title. Then I am counting the words and giving a point to each time they occur on the page.

My problem now is I don't know how to code a word grouping counter.

Example Text: "After a weeks long media circus that had Charlie Sheen..."

It would start with "After + a" (the first two words in the text). and search for any other "After a" matches. If there is a match then the count will go up by one (++). Then it will add one word to the string "After a weeks" and do the same. It will do this for maybe 5 times...

1. After + a
2. After + a + weeks
3. After + a + weeks + long
4. After + a + weeks + long + media
5. After + a + weeks + long + media + circus

Then it would drop "After" and start back at step one:

1. a + weeks
2. a + weeks + long
3. a + weeks + long + media

and so on...

As I can imaging this could take some time, so I wouldn't do it on the fly but would rather run it in the background possibly as a cron job or something.

Is there anyone out there that could show me what something like this would look like in PHP?

I hope I haven't wore out my welcome yet.

If necessary, I could compensate you with a small gift possibly an Amazon wish list item or something. Please forgive me if this is against the TOS. I didn't read them and I'm new here.

TommySzalapski

It actually wouldn't take all that long (if you are using some kind of hash table or dictionary especially). Your idea seems okay (mine last one works best for very large datasets so maybe not as good for individual articles).
I would modify your approach like this:
Since you are only concerned with up to 5 words, just have a dictionary/list for each size. So each time you look at the next word, you add that word to the current word group for each size and remove the first word from each group. Then you add the group to the correct dictionary or increment the count if it's there already.
In this way, you only have to loop through the article once so it will be very fast (assuming you keep the dictionaries in some good way). If you do it in Java or C# (or any .NET language), then I know there is a Dictionary class that will work perfectly for it. I don't know PHP at all, there might be something there.

To address your TOS question:
This quesion asked for an algorithm, not code. The best way to handle this would be to accept the solutions in this thread that gave you the algorithm you wanted, then ask a related question in the PHP zone asking for help coding the algorithm.
Giving a gift isn't totally against the ideas in the TOS, but if you want to give compensation for work here, the correct way is to hit the "Hire Me" button (if it's there) in the profile of an expert that you think could/would do it. I can't write PHP, so don't click mine. Go to the PHP zone and look at the stats for who got a lot of points recently (month or year) and ask them.
I'd just post the question in that zone first, though. It won't be a long script, and you will probably find someone who will do it for 500 points.