[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now


Looking for a way to see if two strings are similar (using a better way than similar_text)

Posted on 2005-05-17
Medium Priority
Last Modified: 2008-01-09
I´ve developed a script that grabs news headlines from different websites.
Now I´d like to match similar headlines.

I´ve tried with similar_text, but results are no good enauge.
How does Google News do?

Is there any code I should use to preare the strings to be compared in order to get a more acurrate result? (the content is in spanish)

Thank you!!
Question by:sean__sean
LVL 20

Expert Comment

ID: 14019790
I know at least 2 elements that play into a probabilistic search (which is what you're wanting to accomplish)
1. strip out white noise words (e.g. the, a, an, of, to, from, ...)
2. normalize words to a common form (returned -> return, spent -> spend, etc.)
3.also, I'm guessing that capitalized words should get a priority in match importance. (though I don't know if this continues into spanish).
4. if you have access to a thesaurus it could also prove helpful as you could force words into similar words

e.g. two articles that have the word Japan in the headline have more in common than two articles that have the word "spending" -> "spend" (via the normalization rule) in the title.

and soundex()
definitely won't help you on a full string basis, unless the title are REALLY similar

my advice would be:
1. explode each string
2. normalize its contents (as per the rules above)
3. compare the normalized words to come up with a probability match

for instance:
take four artices:
1. "Koizumi played baseball today."
2. "Baseball spring practice started today"
3. "Koizumi was out having fun today."
4. "cardinals getting ready to play baseball"

obviously the two articles having the most similarity would be 1 and 3.
The only two words in common between them are "Koizumi" and "today"
but today also appears in article 2.

so you'd want to score the matching, e.g. precise word match = 2 points, precise capitalized word match = 5 points.

then you'd want to set a threshold of similarity.
so then articles 1 and 3 have a similarity of 7
articles 2 and 3 have a similarity of 2 (today)
articles 1 and 2 have a similarity of 4 (baseball and today)
articles 4 and 2 have a similarity of 2 (baseball)

since this is news, the word "today" should probably be culled as white noise.

hope these ideas are helpful.

Author Comment

ID: 14022748
Thank you very much for your time.

What you suggested is what I´m trying to do, but I would rather use an existing script than do it myself.

I´m pretty sure someone has already done it!

Expert Comment

ID: 14027122
you might have some success for example by using headline A and searching against for headline database for matches using something like MySQL FULLTEXT index. Headlines containing some of the words from headline A will naturally rank more highly that headlines matching NO words. This doesn't tell you how to make GROUPS of matching headings, but it does tell you how to do pairwise comparisons. You could then use some graph-theoretical methods to group 'neighbour' headlines together, perhaps?

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

LVL 20

Accepted Solution

virmaior earned 2000 total points
ID: 14027298
you could use the similarity values that FULLTEXT supplies to provide your basis.

Expert Comment

ID: 14027369
that's what i'm saying... :)

Expert Comment

ID: 14027387
Hey, FWIW, I think the Google News thing probably does this based on the whole document text and not just the document headline. There's just not going to be enough reliable information in the headline alone.

LVL 14

Expert Comment

ID: 16214119
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: virmaior

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

EE Cleanup Volunteer

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses
Course of the Month19 days, 20 hours left to enroll

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question