Looking for a way to see if two strings are similar (using a better way than similar_text)

I´ve developed a script that grabs news headlines from different websites.
Now I´d like to match similar headlines.

I´ve tried with similar_text, but results are no good enauge.
How does Google News do?

Is there any code I should use to preare the strings to be compared in order to get a more acurrate result? (the content is in spanish)

Thank you!!
LVL 1
sean__seanAsked:
Who is Participating?
 
virmaiorConnect With a Mentor Commented:
you could use the similarity values that FULLTEXT supplies to provide your basis.
0
 
virmaiorCommented:
I know at least 2 elements that play into a probabilistic search (which is what you're wanting to accomplish)
1. strip out white noise words (e.g. the, a, an, of, to, from, ...)
2. normalize words to a common form (returned -> return, spent -> spend, etc.)
3.also, I'm guessing that capitalized words should get a priority in match importance. (though I don't know if this continues into spanish).
4. if you have access to a thesaurus it could also prove helpful as you could force words into similar words

e.g. two articles that have the word Japan in the headline have more in common than two articles that have the word "spending" -> "spend" (via the normalization rule) in the title.

similar_text()
levenschtein()
and soundex()
definitely won't help you on a full string basis, unless the title are REALLY similar

my advice would be:
1. explode each string
2. normalize its contents (as per the rules above)
3. compare the normalized words to come up with a probability match

for instance:
take four artices:
1. "Koizumi played baseball today."
2. "Baseball spring practice started today"
3. "Koizumi was out having fun today."
4. "cardinals getting ready to play baseball"

obviously the two articles having the most similarity would be 1 and 3.
The only two words in common between them are "Koizumi" and "today"
but today also appears in article 2.

so you'd want to score the matching, e.g. precise word match = 2 points, precise capitalized word match = 5 points.

then you'd want to set a threshold of similarity.
so then articles 1 and 3 have a similarity of 7
articles 2 and 3 have a similarity of 2 (today)
articles 1 and 2 have a similarity of 4 (baseball and today)
articles 4 and 2 have a similarity of 2 (baseball)

since this is news, the word "today" should probably be culled as white noise.

hope these ideas are helpful.
0
 
sean__seanAuthor Commented:
Thank you very much for your time.

What you suggested is what I´m trying to do, but I would rather use an existing script than do it myself.

I´m pretty sure someone has already done it!
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
jdpipeCommented:
you might have some success for example by using headline A and searching against for headline database for matches using something like MySQL FULLTEXT index. Headlines containing some of the words from headline A will naturally rank more highly that headlines matching NO words. This doesn't tell you how to make GROUPS of matching headings, but it does tell you how to do pairwise comparisons. You could then use some graph-theoretical methods to group 'neighbour' headlines together, perhaps?

JP
0
 
jdpipeCommented:
that's what i'm saying... :)
0
 
jdpipeCommented:
Hey, FWIW, I think the Google News thing probably does this based on the whole document text and not just the document headline. There's just not going to be enough reliable information in the headline alone.

JP
0
 
hujiCommented:
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.
I will leave the following recommendation for this question in the Cleanup topic area:
Accept: virmaior

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

Huji
EE Cleanup Volunteer
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.