Link to home
Start Free TrialLog in
Avatar of cyber-33
cyber-33Flag for United States of America

asked on

Python algorithm

I need a python function that finds a MINIMUM set of words in a collection of sentences. The set must include at least one word from each sentence. The output should be a set of tuples, with the first element identifying the word, and the second element representing a list of sentence ids the word correlates to.

Input Example:
1. "Expert exchange is a nice site"
2. "There are few sites like expert exchange"
3. "google is a search engine"
4. "i can search for anything"
5. "cold day"

Output:
(exchange, {1,2}), (search, {3,4}), (cold, {5})

Thank you, experts!
Avatar of ozo
ozo
Flag of United States of America image

This is equivalent to Minimum Set Cover
https://en.wikipedia.org/wiki/Set_cover_problem
Avatar of cyber-33

ASKER

Not really, as the idea here is Not to cover the sets, like in the cover problem, but to find common values.

So in the example given by the article where the set is  S = \{\{1, 2, 3\}, \{2, 4\}, \{3, 4\}, \{4, 5\}\}, the output that I need should be just 2 numbers 2 and 4. The first number will represent first 2 elements, and the second number will represent last 2 elements.

Interesting read, though.
How is python going to receive the input and exactly what format will it be in? If it is identical to your example then I assume you are reading this data in from a file or something?

Secondly, does it matter which word is selected from each sentence? For example would it matter if the first word from each sentence is selected or does the function need to find the most amount of common words from all sentances and choose that one to run the sets on?

A little more guidance and I think I can help you (I hope).
1. "Expert exchange is a nice site"
2. "There are few sites like expert exchange"
3. "google is a search engine"
4. "i can search for anything"
5. "cold day"
exchange = {1, 2}
search = {3, 4}
cold = {5}
Expert={1}
is={1,3}
a={1,3}
nice={1}
site={1}
There = {2}
are = {2}
few = {2}
sites={2}
like={2}
expert={2}
...
S={{1, 2}, {3, 4}, {5},{1},{1,3},{1,3},{1},{1},{2},{2},{2},{2},...}
Mark - thank you so much!
The sentences will come from a text file, one sentence per line.

I am planning to use a simple function that verifies whether the word is "interesting" or not and thus should be considered. For example, that function can return true if the word is not on the list of noise words.  That's pretty simple though and I was not feeling comfortable to ask the experts to spend their time on this. So for example the word "is" would likely appear in most sentences, but its not helpful.
Ozo, I must have confused you with my instructions - my apologies. The answer that you provide simply lists counts of words in sentences - that's not what  I need. I really need the MINIMUM set of words that would include at least one word from each sentences. So in my example after

exchange = {1, 2}
search = {3, 4}
cold = {5}

no other words are necessary because sentences 1,2,3,4 and 5 are already referenced.
So the Minimum Set Cover of
S={{1, 2}, {3, 4}, {5},{1},{1,3},{1,3},{1},{1},{2},{2},{2},{2},...}
is
{{1, 2}, {3, 4}, {5}}
Ozo,
Assuming the set S represents the collection of sentences {}, the minimum Set in your example would be

1, 3, 5, 2

Note, 4 for is NOT needed because the set {3,4} is already referenced via 3.
"is" is more helpful here than either "nice" or "engine", since any cover that contained "nice" could be replaced with a cover that contained "is"
There is no word that covers sentences {1, 3, 5, 2}

search = {3,4} = {"google is a search engine", "i can search for anything"}
Either "search" or "i" or "can" or "for" or "anything" is required to cover 4
Of these, "search" is the most useful because it also covers 3
Ozo, re search - you are correct. In my prior comment I was referring to the set listed in your example:
S={{1, 2}, {3, 4}, {5},{1},{1,3},{1,3},{1},{1},{2},{2},{2},{2},...}
S={{1, 2}, {3, 4}, {5},{1},{1,3},{1,3},{1},{1},{2},{2},{2},{2},...}  does not contain the set {1, 3, 5, 2}

It does contain the set {1} ("nice" or "site"), the set {3} ("google" or "engine"), the set {5} ("cold" or "day"), and the set {2} ("There" or "are" or "few" or "sites" or "like")
but {{1}, {3}, {5}, {2}} is neither minimal nor a cover
Does case sensitivity really matter?  Although you used "exchange", you might have used "Expert" since it appears before "exchange" in [1, 2], only differing by case.
When words are the same may be beyond the scope of this question.
"Expert" may be the same as "expert", but would "Nice" be the same as "nice"?  (even if you've been there and think Nice is nice)
And maybe "sites" could be considered a variant of "site", and if we allow pluralizing, "are" might be considered the same as "is".
Good questions - re similarity of words. I can convert them to lower case to avoid case sensitivity. Also it would be nice to check for synonyms - perhaps they can be grouped together. But as OZO pointed out - that's beyound the scope of this question and is a "very nice have" rather than a must.
If there are no more than 32 sentences, I might represent each set as a bitmapped int,
eliminate any word that's a subset of another word,
and do a greedy ordered  depth first brute force search.
The word part might look like the following.  I'm using print to debug the code.  Also, your sentences are hard-coded rather than being read from a file.

I think the next step is to use a combination/permutation library to combine these until you find a covering set.

    d = {}
    stopwords=['a','able','about','across','after','all','almost','also','am','among','an','and','any','are','as','at','be','because','been','but','by','can','cannot','could','dear','did','do','does','either','else','ever','every','for','from','get','got','had','has','have','he','her','hers','him','his','how','however','i','if','in','into','is','it','its','just','least','let','like','likely','may','me','might','most','must','my','neither','no','nor','not','of','off','often','on','only','or','other','our','own','rather','said','say','says','she','should','since','so','some','than','that','the','their','them','then','there','these','they','this','tis','to','too','twas','us','wants','was','we','were','what','when','where','which','while','who','whom','why','will','with','would','yet','you','your']
    slist = ["Expert exchange is a nice site", "There are few sites like expert exchange",
            "google is a search engine", "i can search for anything", "cold day"]
    for posn,s in enumerate(slist):
        for w in s.lower().split(" "):
          if w not in stopwords:
            if d.has_key(w):
                d[w].append(posn)
            else:
                d[w]= [posn]
    print d
    dd = sorted(d,key=lambda l: len(d[l]), reverse=True)  #sort by list length
    print dd
    for ddd in dd:
        print ddd,d[ddd]

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Have you had a chance to test this?
This is a pretty cool algorithm. I tested it with additional sentences and got pretty good results. Thank you, aikimark, for a great solution!!!