I'm working on a project that will extract different components from a webpage to try to come up with a set of relevant keywords to summarize the page.
Charlie Sheen Fired From 'Two and a Half Men'
Example Article Text:
"After a weeks-long media circus that had Charlie Sheen attacking everything from Alcoholics Anonymous to Two and a Half Men’s co-creator Chuck Lorre, Warner Bros. Television has decided to fire the No. 1 comedy’s star.
"After careful consideration, Warner Bros. Television has terminated Charlie Sheen’s services on Two and a Half Men effective immediately," Warner Bros Television said in a statement Monday.
Reached by TMZ, Sheen, 45, said: "This is very good news. They continue to be in breach, like so many whales. It is a big day of gladness at the Sober Valley Lodge because now I can take all of the bazillions, never have to look at whatshiscock again and I never have to put on those silly shirts for as long as this warlock exists in the terrestrial dimension...."
Example keyword summary:
- sheen - 71
- charlie - 68
- fired - 61
- half - 45
- hollywood - 29
- two - 25
- news - 25
- reporter - 23
- warner - 7
- bros - 7
- television - 7
- cbs - 5
My hope is to create a keyword list that will be relatively accurate in summarizing the article.
So my question to the EE community is what are some things I should do to make this list better or more accurate?
Any and all ideas, or suggestions will be great!
The technologies I will be using will be php and MySQL however I'm open to suggestions but for this discussion I would like to stick to the logic / algorithm.