Fuzzy Search on python lookup

I am using this solution to provide a lookup: http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_28507029.html

I have a list of categories in a table.
I have a lookup now, thanks to two awesome individuals in the other question.

My problem now is that the categories in the table does not match exactly as the lookup.
For example.

I have "Sports" and "Outdoors" in the lookup, but I have "Sports & Outdoors" in the table.
I have "Electrical Wires" and "Electrical Cables" in the lookup, but I have "Electrical" in the table.

What I want to do, is run through the lookup to find all the possible values, then run a MAX of the potential output values and use that.

I have the MAX thing.  What I need to do is the fuzzy matching thing.  How can I do this in python?
Thanks
LVL 9
Evan CutlerVolunteer Chief Information OfficerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

aikimarkCommented:
Use a regular expression to find/match the shorter name anywhere within the longer name.

Your pattern would be something like this: ".*?" + shortername + ".*?"
0
peprCommented:
I did not use it personally; anyway, I would try the FuzzyWuzzy project  http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ The git repository is at https://github.com/seatgeek/fuzzywuzzy.
0
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
the problem I am having is that the lookup table is set in memory.  CSV files are read to load it:
lookup[row[0]] = fixed_formula(row[1])

To that end, the expectation is that the category name will match the key by string matching:
lookup[category]

This is where I have the problem.  How do I fuzzy match to the keys in the lookup table?
Thanks
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

aikimarkCommented:
keys are exact matched.  (think hash tables)

You will have to iterate the keys to do fuzzy or pattern matching.
0
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
pepr, I like your fuzzywuzzy, but I'm python3, is there a set of instructions for python3?
Tanks
0
peprCommented:
@Evan: FuzzyWuzzy should work also with Python 3. I will try when being at a normal computer.

As aikimark mentioned, you should fuzzy match the categories from your lookup table against the string and use the category with the maximum match value.
0
peprCommented:
Evan, I did install the FuzzyWuzzy on Windows with Python 3.4 and it works. Describe what you did.

I tried the following code:

c.py
#!python3

import csv
from fuzzywuzzy import process

class Calculator:

    def __init__(self, csv_fname):

        # Load the lookup table of the calculator, fill
        # it with formulas from csv_fname.
        self.lookup = {}
        with open(csv_fname, newline='') as f:
            reader = csv.reader(f)
            for row in reader:
                self.lookup[row[0]] = self.fixed_formula(row[1])

        # Get the list of all possible categories.
        self.categories = list(self.lookup)

    def fixed_formula(self, formula):
        '''Auxiliary function to check/fix the formula syntax.'''
        return formula.strip(). replace('x', '*')

    def price(self, Price, goods_name):
        category, probability = process.extractOne(
                                goods_name, self.categories)
        result = eval(self.lookup[category])   #!!! WARNING eval() is Evil
        print('{}: {} --> {} ({}, {} %)'.format(
            goods_name, Price, result, category, probability))
        return result


if __name__ == '__main__':

    # Set the calculator that uses the formulas
    # for categories defined in the file.
    calc = Calculator('data.csv')
    p1 = calc.price(16.5, 'Electric wires')
    p2 = calc.price(20.0, 'Electric cables')
    p3 = calc.price(25.0, 'Sports & Outdoors')

Open in new window

and the data.csv
Sports, .15xPrice + 2.00
Outdoors, .17xPrice + 2.00
Electrical, .18xPrice + 2.00

Open in new window

The script prints
c:\...\Q_28508267>c.py
Electric wires: 16.5 --> 4.97 (Electrical, 67 %)
Electric cables: 20.0 --> 5.6 (Electrical, 80 %)
Sports & Outdoors: 25.0 --> 5.75 (Sports, 90 %)

Open in new window

However, it is fuzzy. It requires further investigation. Warning: I did use the FuzzyWuzzy for the first time and I did not study internals.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
that is awesome pepr....
I tried to install fuzzywuzzy (ok, that name is funny)
using pip:

pip install fuzzywuzzy.  I am on python 3.4
it's giving me a decoding cp1252 error.  the actual error says:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1983: cha
racter maps to <undefined>

I've seen this error before, and in the past I was told it was a python2 vs python3 thing.
0
peprCommented:
http://en.wikipedia.org/wiki/Talk%3AFuzzy_Wuzzy   :)

I did not try pip on Windows. Try to download the zip from http://github.com/seatgeek/fuzzywuzzy/zipball/master
unzip it and if you have both Python 2 and Python 3, then install it via

py -3 setup.py install

from within the unpacked directory.
0
peprCommented:
I also recommend to clean-up the data.csv -- the formulas. The xPrice is really fragile.
0
peprCommented:
What was the situation when you did observe the UnicodeDecodeError? It may be the case that you wanted to print some text on cmd console and Python is not capable to convert the text into cp1252 for the console. If this is the case, try to write the result to the UTF-8 file instead:
    ... 
    with open('output.txt', 'w', encoding='utf-8') as f:
        ...   
        f.write('the text')
        ...
    ...

Open in new window

0
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Thanks. No. It gave the error when I tried to pip install.
I'll try the python setup method
Thanks.
Ill award points now, and if I have issues during install I'll re-ask for more points.
Thanks.
0
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
This is definitely going in my library.
Thanks
0
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Oh, and I used x as a mistake.  I replaced my formulas with *already.  Thanks
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.