Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Fuzzy Search on python lookup

Posted on 2014-08-29
14
Medium Priority
?
1,317 Views
Last Modified: 2014-08-30
I am using this solution to provide a lookup: http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_28507029.html

I have a list of categories in a table.
I have a lookup now, thanks to two awesome individuals in the other question.

My problem now is that the categories in the table does not match exactly as the lookup.
For example.

I have "Sports" and "Outdoors" in the lookup, but I have "Sports & Outdoors" in the table.
I have "Electrical Wires" and "Electrical Cables" in the lookup, but I have "Electrical" in the table.

What I want to do, is run through the lookup to find all the possible values, then run a MAX of the potential output values and use that.

I have the MAX thing.  What I need to do is the fuzzy matching thing.  How can I do this in python?
Thanks
0
Comment
Question by:Evan Cutler
  • 6
  • 6
  • 2
14 Comments
 
LVL 46

Expert Comment

by:aikimark
ID: 40293537
Use a regular expression to find/match the shorter name anywhere within the longer name.

Your pattern would be something like this: ".*?" + shortername + ".*?"
0
 
LVL 29

Expert Comment

by:pepr
ID: 40293763
I did not use it personally; anyway, I would try the FuzzyWuzzy project  http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ The git repository is at https://github.com/seatgeek/fuzzywuzzy.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 40293914
the problem I am having is that the lookup table is set in memory.  CSV files are read to load it:
lookup[row[0]] = fixed_formula(row[1])

To that end, the expectation is that the category name will match the key by string matching:
lookup[category]

This is where I have the problem.  How do I fuzzy match to the keys in the lookup table?
Thanks
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 46

Expert Comment

by:aikimark
ID: 40294049
keys are exact matched.  (think hash tables)

You will have to iterate the keys to do fuzzy or pattern matching.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 40294054
pepr, I like your fuzzywuzzy, but I'm python3, is there a set of instructions for python3?
Tanks
0
 
LVL 29

Expert Comment

by:pepr
ID: 40294355
@Evan: FuzzyWuzzy should work also with Python 3. I will try when being at a normal computer.

As aikimark mentioned, you should fuzzy match the categories from your lookup table against the string and use the category with the maximum match value.
0
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 40294634
Evan, I did install the FuzzyWuzzy on Windows with Python 3.4 and it works. Describe what you did.

I tried the following code:

c.py
#!python3

import csv
from fuzzywuzzy import process

class Calculator:

    def __init__(self, csv_fname):

        # Load the lookup table of the calculator, fill
        # it with formulas from csv_fname.
        self.lookup = {}
        with open(csv_fname, newline='') as f:
            reader = csv.reader(f)
            for row in reader:
                self.lookup[row[0]] = self.fixed_formula(row[1])

        # Get the list of all possible categories.
        self.categories = list(self.lookup)

    def fixed_formula(self, formula):
        '''Auxiliary function to check/fix the formula syntax.'''
        return formula.strip(). replace('x', '*')

    def price(self, Price, goods_name):
        category, probability = process.extractOne(
                                goods_name, self.categories)
        result = eval(self.lookup[category])   #!!! WARNING eval() is Evil
        print('{}: {} --> {} ({}, {} %)'.format(
            goods_name, Price, result, category, probability))
        return result


if __name__ == '__main__':

    # Set the calculator that uses the formulas
    # for categories defined in the file.
    calc = Calculator('data.csv')
    p1 = calc.price(16.5, 'Electric wires')
    p2 = calc.price(20.0, 'Electric cables')
    p3 = calc.price(25.0, 'Sports & Outdoors')

Open in new window

and the data.csv
Sports, .15xPrice + 2.00
Outdoors, .17xPrice + 2.00
Electrical, .18xPrice + 2.00

Open in new window

The script prints
c:\...\Q_28508267>c.py
Electric wires: 16.5 --> 4.97 (Electrical, 67 %)
Electric cables: 20.0 --> 5.6 (Electrical, 80 %)
Sports & Outdoors: 25.0 --> 5.75 (Sports, 90 %)

Open in new window

However, it is fuzzy. It requires further investigation. Warning: I did use the FuzzyWuzzy for the first time and I did not study internals.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 40294638
that is awesome pepr....
I tried to install fuzzywuzzy (ok, that name is funny)
using pip:

pip install fuzzywuzzy.  I am on python 3.4
it's giving me a decoding cp1252 error.  the actual error says:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1983: cha
racter maps to <undefined>

I've seen this error before, and in the past I was told it was a python2 vs python3 thing.
0
 
LVL 29

Expert Comment

by:pepr
ID: 40294664
http://en.wikipedia.org/wiki/Talk%3AFuzzy_Wuzzy   :)

I did not try pip on Windows. Try to download the zip from http://github.com/seatgeek/fuzzywuzzy/zipball/master
unzip it and if you have both Python 2 and Python 3, then install it via

py -3 setup.py install

from within the unpacked directory.
0
 
LVL 29

Expert Comment

by:pepr
ID: 40294669
I also recommend to clean-up the data.csv -- the formulas. The xPrice is really fragile.
0
 
LVL 29

Expert Comment

by:pepr
ID: 40294676
What was the situation when you did observe the UnicodeDecodeError? It may be the case that you wanted to print some text on cmd console and Python is not capable to convert the text into cp1252 for the console. If this is the case, try to write the result to the UTF-8 file instead:
    ... 
    with open('output.txt', 'w', encoding='utf-8') as f:
        ...   
        f.write('the text')
        ...
    ...

Open in new window

0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 40294696
Thanks. No. It gave the error when I tried to pip install.
I'll try the python setup method
Thanks.
Ill award points now, and if I have issues during install I'll re-ask for more points.
Thanks.
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
ID: 40294697
This is definitely going in my library.
Thanks
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 40294698
Oh, and I used x as a mistake.  I replaced my formulas with *already.  Thanks
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

What do responsible coders do? They don't take detrimental shortcuts. They do take reasonable security precautions, create important automation, implement sufficient logging, fix things they break, and care about users.
The SignAloud Glove is capable of translating American Sign Language signs into text and audio.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Screencast - Getting to Know the Pipeline
Suggested Courses
Course of the Month21 days, 3 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question