Looking for a way to see if two strings are similar (using a better way than similar_text)

Posted on 2005-05-17
Last Modified: 2008-01-09
I´ve developed a script that grabs news headlines from different websites.
Now I´d like to match similar headlines.

I´ve tried with similar_text, but results are no good enauge.
How does Google News do?

Is there any code I should use to preare the strings to be compared in order to get a more acurrate result? (the content is in spanish)

Thank you!!
Question by:sean__sean
    LVL 20

    Expert Comment

    I know at least 2 elements that play into a probabilistic search (which is what you're wanting to accomplish)
    1. strip out white noise words (e.g. the, a, an, of, to, from, ...)
    2. normalize words to a common form (returned -> return, spent -> spend, etc.)
    3.also, I'm guessing that capitalized words should get a priority in match importance. (though I don't know if this continues into spanish).
    4. if you have access to a thesaurus it could also prove helpful as you could force words into similar words

    e.g. two articles that have the word Japan in the headline have more in common than two articles that have the word "spending" -> "spend" (via the normalization rule) in the title.

    and soundex()
    definitely won't help you on a full string basis, unless the title are REALLY similar

    my advice would be:
    1. explode each string
    2. normalize its contents (as per the rules above)
    3. compare the normalized words to come up with a probability match

    for instance:
    take four artices:
    1. "Koizumi played baseball today."
    2. "Baseball spring practice started today"
    3. "Koizumi was out having fun today."
    4. "cardinals getting ready to play baseball"

    obviously the two articles having the most similarity would be 1 and 3.
    The only two words in common between them are "Koizumi" and "today"
    but today also appears in article 2.

    so you'd want to score the matching, e.g. precise word match = 2 points, precise capitalized word match = 5 points.

    then you'd want to set a threshold of similarity.
    so then articles 1 and 3 have a similarity of 7
    articles 2 and 3 have a similarity of 2 (today)
    articles 1 and 2 have a similarity of 4 (baseball and today)
    articles 4 and 2 have a similarity of 2 (baseball)

    since this is news, the word "today" should probably be culled as white noise.

    hope these ideas are helpful.
    LVL 1

    Author Comment

    Thank you very much for your time.

    What you suggested is what I´m trying to do, but I would rather use an existing script than do it myself.

    I´m pretty sure someone has already done it!
    LVL 7

    Expert Comment

    you might have some success for example by using headline A and searching against for headline database for matches using something like MySQL FULLTEXT index. Headlines containing some of the words from headline A will naturally rank more highly that headlines matching NO words. This doesn't tell you how to make GROUPS of matching headings, but it does tell you how to do pairwise comparisons. You could then use some graph-theoretical methods to group 'neighbour' headlines together, perhaps?

    LVL 20

    Accepted Solution

    you could use the similarity values that FULLTEXT supplies to provide your basis.
    LVL 7

    Expert Comment

    that's what i'm saying... :)
    LVL 7

    Expert Comment

    Hey, FWIW, I think the Google News thing probably does this based on the whole document text and not just the document headline. There's just not going to be enough reliable information in the headline alone.

    LVL 14

    Expert Comment

    No comment has been added to this question in more than 21 days, so it is now classified as abandoned.
    I will leave the following recommendation for this question in the Cleanup topic area:
    Accept: virmaior

    Any objections should be posted here in the next 4 days. After that time, the question will be closed.

    EE Cleanup Volunteer

    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    Join & Write a Comment

    Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
    Both Easy and Powerful How easy is PHP? (  Very easy.  It has been described as "a programming language even my grandmother can use." How powerful is PHP?  http://en.wikiped…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
    The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

    734 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    22 Experts available now in Live!

    Get 1:1 Help Now